河源书店
RIVERHEAD BOOKS
企鹅集团出版
Published by the Penguin Group
企鹅集团(美国)有限责任公司
Penguin Group (USA) LLC
哈德逊街375号
375 Hudson Street
纽约州纽约市 10014
New York, New York 10014
美国 • 加拿大 • 英国 • 爱尔兰 • 澳大利亚 • 新西兰 • 印度 • 南非 • 中国
USA • Canada • UK • Ireland • Australia • New Zealand • India • South Africa • China
企鹅兰登书屋公司
A Penguin Random House Company
版权所有 © 2013 Erez Aiden 和 Jean-Baptiste Michel
Copyright © 2013 by Erez Aiden and Jean-Baptiste Michel Uncharted_Big Data as a Lens on Human Culture
企鹅出版社支持版权。版权能够激发创造力,鼓励多元声音,促进言论自由,并创造充满活力的文化。感谢您购买本书的授权版本,并遵守版权法,未经许可不得以任何形式复制、扫描或分发本书的任何部分。您是在支持作家,也让企鹅出版社能够继续为每一位读者出版书籍。
Penguin supports copyright. Copyright fuels creativity, encourages diverse voices, promotes free speech, and creates a vibrant culture. Thank you for buying an authorized edition of this book and for complying with copyright laws by not reproducing, scanning, or distributing any part of it in any form without permission. You are supporting writers and allowing Penguin to continue to publish books for every reader.
ISBN 978-1-101-63211-6
ISBN 978-1-101-63211-6
尽管作者已尽力在出版时提供准确的电话号码、网址和其他联系信息,但出版商和作者均不对出版后出现的错误或变更承担任何责任。此外,出版商对作者或第三方网站及其内容不拥有任何控制权,亦不承担任何责任。
While the authors have made every effort to provide accurate telephone numbers, Internet addresses, and other contact information at the time of publication, neither the publisher nor the authors assume any responsibility for errors, or for changes that occur after publication. Further, the publisher does not have any control over and does not assume any responsibility for author or third-party websites or their content.
版本_1
Version_1
对于阿巴来说,
For Aba,
我一直相信我能数数
who always believed I could count
埃雷兹·艾登
EREZ AIDEN
—
—
致我的家人
To my family
让·巴蒂斯特·米歇尔
JEAN-BAPTISTE MICHEL
爱丽丝镜中奇遇记
THROUGH THE LOOKING GLASS
我想象一下,如果我们拥有一个机器人,它能阅读世界各地各大图书馆每个书架上的所有书籍。它会以超快的机器人速度阅读这些书籍,并凭借其超级可靠的机器人记忆力记住所读到的每一个字。我们能从这位机器人历史学家身上学到什么呢?
Imagine if we had a robot that could read every book on every shelf of every major library, all over the world. It would read these books at a super-fast robot speed and remember every single word that it had read, using its super-infallible robot memory. What could we learn from this robot historian?
举个简单的例子,每个美国人都耳熟能详。今天,我们说南方各州都是南方人。我们说北方各州都是北方人。我们说新英格兰各州都是新英格兰人。但我们却说美国到处都是公民。
Here’s a simple example that’s familiar to every American. Today, we say that the southern states are full of southerners. We say that the northern states are full of northerners. We say that the New England states are full of New Englanders. Yet we say that the United States is full of citizens.
我们为什么要用单数?这不仅仅是语法上的问题,更是关乎我们民族认同的问题。
Why do we use the singular? This is more than a fine point of grammar: It’s a matter of our national identity.
美利坚合众国成立时,其建国文献《邦联条例》确立了一个弱化的中央政府,并称这个新实体并非一个单一的国家,而是各州之间的“友好联盟”,有点类似于今天的欧盟。他们不认为自己是美国人,而是某个州的公民。
When the United States of America was established, its founding document, the Articles of Confederation, defined a weak central government, and referred to the new entity not as a single nation but instead as a “league of friendship” between individual states, somewhat akin to today’s European Union. People thought of themselves not as Americans but as citizens of a particular state.
因此,公民们在提及“美国”时会使用复数形式,这适用于由多个独立州组成的国家。例如,约翰·亚当斯总统在1799年的国情咨文中提到“美国与英国国王陛下签订的条约中,没有提到任何条款。”对于今天的总统来说,这样做是不可想象的。
As such, citizens referred to “the United States” in the plural, as would be appropriate for a collection of distinct, mostly independent states. For instance, in President John Adams’ 1799 State of the Union address, he talked about “the United States in their treaties with His Britannic Majesty.” For a president to do that today would be inconceivable.
“我们人民”(宪法(1787 年通过)真正成为“一个国家”(效忠誓言,1942 年通过)吗?
When did “We the People” (Constitution, adopted 1787) truly become “one nation” (Pledge of Allegiance, adopted 1942)?
如果我们询问人类历史学家,他们可能会给我们指出最著名的答案,出自詹姆斯·麦克弗森著名的内战史著作《自由的呐喊》的结尾:
If we asked human historians, they would probably point us to the most famous answer, from the end of James McPherson’s celebrated Civil War history, Battle Cry of Freedom:
...这场战争的某些重大后果显而易见。分裂国家和奴隶制被终结,自阿波马托克斯战役以来的一百二十五年里再未复兴。这些结果标志着美国社会和政体发生了更广泛的变革,这场战争即便不是其唯一成就,也贯穿了整个变革。1861年之前,“美国”这两个词通常被译为复数名词:“美国是一个共和国”。这场战争标志着“美国”一词向单数名词的转变。
. . . Certain large consequences of the war seem clear. Secession and slavery were killed, never to be revived during the century and a quarter since Appomattox. These results signified a broader transformation of American society and polity punctuated if not alone achieved by the war. Before 1861 the two words “United States” were generally rendered as a plural noun: “the United States are a republic.” The war marked a transition of the United States to a singular noun.
麦克弗森并非第一个提出这个建议的人;这个老生常谈的问题至少已经讨论了一百年。不妨看看1887年《华盛顿邮报》的以下摘录:
McPherson wasn’t the first to make this suggestion; this old chestnut has been discussed for at least a hundred years. Consider the following excerpt from the Washington Post in 1887:
几年前,人们曾用复数来称呼美国。人们说“美国是”——“美国曾经是”——“美国曾经是”。但是战争改变了这一切。从切萨皮克湾到萨宾帕斯的战线上,语法问题永远地解决了。决定这一切的不是威尔斯、格林或林德利·默里,而是谢里登的军刀、谢尔曼的火枪、格兰特的炮兵……戴维斯先生和李将军的投降意味着从复数到单数的转变。
There was a time a few years ago when the United States was spoken of in the plural number. Men said “the United States are”—“the United States have”—“the United States were.” But the war changed all that. Along the line of fire from the Chesapeake to Sabine Pass was settled forever the question of grammar. Not Wells, or Green, or Lindley Murray decided it, but the sabers of Sheridan, the muskets of Sherman, the artillery of Grant. . . . The surrender of Mr. Davis and Gen. Lee meant a transition from the plural to the singular.
即使一个世纪过去了,读到这个关于语言、炮火和冒险的激动人心的故事,也很难不感到激动。谁能想到一场关于语法的战争,或者一个微妙的用法问题竟然被“谢尔曼的火枪”解决了?
Even a century later, it’s hard not to get a thrill just reading this stirring tale of language, artillery, and adventure. Who could have dreamed of a war about grammar, or a subtle point of usage settled by “the muskets of Sherman”?
但我们应该相信吗?
But should we believe it?
很有可能。詹姆斯·麦克弗森是美国历史协会前主席,也是历史学家中的传奇人物。他最著名的作品《自由的呐喊》曾获得普利策奖。此外,1887年《华盛顿邮报》那篇文章的作者很可能亲身经历了这种句法上的转变,他们的目击证词再清楚不过了。
Probably. James McPherson is a former president of the American Historical Association and a legend among historians. Battle Cry of Freedom, his most famous work, won the Pulitzer. Moreover, whoever wrote that 1887 Washington Post article probably experienced this syntactic turnabout firsthand, and their eyewitness testimony couldn’t be clearer.
詹姆斯·麦克弗森虽然才华横溢,但也并非绝对可靠。目击者有时也会弄错事实。我们能不能做得更好?
Still, James McPherson, though brilliant, isn’t infallible. And eyewitnesses sometimes get the facts wrong. Is there some way that we can do better?
也许吧。假设我们要求我们的机器人——假设已经读过所有图书馆里所有书籍的机器人——贡献它的机械化意见。
Perhaps. Suppose that we ask our robot—the hypothetical robot that has read all the books in all the libraries—to contribute its mechanized opinion.
假设,为了回答我们的问题,我们乐于助人的机器人历史学家利用其惊人的记忆力绘制了下面的图表。该图表显示了“美国是”和“美国是”这两个词组在美国出版的英文书籍中随时间推移的使用频率。水平轴代表时间的流逝,逐年递增。纵轴代表这两个短语的频率:它们在特定年份每十亿字文本中平均出现的频率。例如,机器人阅读了1831年出版的书籍中出现的313,388,047个单词。在这些单词中,机器人看到了62,759次“美国是”这个短语。这意味着,该年份每十亿字平均出现了20次,这由1831年对应行的高度表示。
Suppose that, in response to our question, our helpful robot historian draws on its prodigious memory to make the chart that follows. The robot’s chart shows how frequently the phrases “The United States is” and “The United States are” were used over time, in English books published in the United States. Horizontally, we see the flow of time, year by year. The vertical axis shows the frequency of the two phrases: how often they appear, on average, in every billion words of text written during the year in question. For instance, the robot read 313,388,047 words that appeared in books published in the year 1831. Within those words, the robot sees the phrase “The United States is” 62,759 times. That averages out to twenty times per billion words that year, indicated by the height of the corresponding line in 1831.
当人们开始用单数形式谈论美国时,这样的图表会让人一目了然。
A chart like this would make it completely clear when people started talking about the United States in the singular.
只有一个小问题:根据假设机器人的假设图表,我们之前讲的故事是错误的。首先,从复数到单数的转变并非瞬间发生的。它是渐进的,始于19世纪10年代,一直持续到20世纪80年代——跨越了一个半世纪以上。更重要的是,内战期间并没有出现突然的转变。事实上,战争年代与战前或战后并没有太大区别。战后确实有一些加速,但它始于李将军投降五年后。据机器人说,单数形式直到1880年,也就是战争结束十五年后,才变得更加普遍。即使在今天,州旗飘扬的邦联的复数旗帜仍然飘扬。
There’s just one small hitch: According to the hypothetical robot’s hypothetical chart, the story we were telling you before is wrong. For one thing, the transition from plural to singular was not instantaneous. It was gradual, starting in the 1810s and continuing into the 1980s—a span of more than a century and a half. More important, there was no sudden switch during the Civil War. In fact, the war years did not differ much from the years immediately before or after. There was some postbellum acceleration, but it began five years after General Lee’s surrender. According to the robot, the singular form did not become more common until 1880, fifteen years after the war. And even today, the plural banner of the state-spangled confederacy yet waves.
当然,这都是假设,因为关于一个速读机器人智胜目击者和获奖历史学家的说法实在是太牵强了。
Of course, this is all hypothetical, because this stuff about a speed-reading robot outwitting an eyewitness and a prizewinning historian is so utterly far-fetched.
但这一切都是真的。
Except that it’s all true.
麦克弗森虽然才华横溢,但在单数形式上却犯了错误。目击者并没有准确回忆事件。我们之前提到的机器人确实存在。我们刚才展示的图表就是机器人绘制的。还有数十亿张图表等待着它绘制。如今,全世界数百万人正在以一种新的方式看待历史:通过机器人的数字眼睛。
McPherson, though brilliant, was wrong about the singular form. The eyewitness didn’t recall events accurately. And the robot we were telling you about exists. And the chart we just showed you is the chart the robot drew. And there are a billion more charts it’s just waiting to draw. And today, all over the world, millions of people are seeing history in a new way: through the digital eyes of a robot.
这并不是新型镜头第一次影响我们看待世界的方式。
This is not the first time that a new kind of lens has influenced how we look at the world.
十三世纪末,一项名为眼镜的新发明开始在意大利迅速传播。几十年间,眼镜从无到有,从奇特到如今已是家常便饭。作为智能手机的先驱,眼镜是许多意大利人不可或缺的配饰,将时尚与实用融为一体,成就了可穿戴技术的早期辉煌。
In the late thirteenth century, a new invention, eyeglasses, began spreading like wildfire through Italy. In a matter of decades, glasses went from nonexistent to merely exotic to utterly commonplace. Forerunners of the smartphone, eyeglasses were an indispensable appliance for many Italians, combining fashion and function into an early triumph of wearable technology.
随着眼镜在欧洲和世界各地传播,验光成为一门大生意,而制造眼镜的技术镜片越来越好,越来越便宜。不可避免地,人们开始尝试将多个透镜组合在一起,看看效果如何。不久之后,人们意识到,只需稍加改造,就能实现极致的放大倍数。复合透镜可以揭示肉眼看不见的新世界。
As eyeglasses spread across Europe and around the world, optometry became big business, and the technology for making lenses got better and cheaper. Inevitably, people began to experiment with what could be done when multiple lenses were combined. It wasn’t long before folks realized that with a little bit of engineering, they could achieve extreme magnification. Compound lenses could be made to reveal new worlds invisible to the naked eye.
例如,复合透镜可以用来放大非常小的物体。显微镜揭示了至少两个关于古老生命奥秘的惊人事实。它们表明,我们周围的动植物被细分为微小的、物理上独立的单元。罗伯特·胡克(Robert Hooke)发现了这一现象,他指出这些单元的排列方式类似于修道院的居住区,因此他称之为细胞。显微镜还揭示了微生物。这种独立的生物体通常仅由一个细胞组成,构成了生物世界的绝大部分。在显微镜发明之前,没有人知道这种生命形式可能存在。
For instance, a compound lens could be used to magnify very small things. Microscopes uncovered at least two astonishing facts about the age-old mystery of life. They showed that the animals and plants all around us are subdivided into tiny, physically separate units. Robert Hooke, who made this discovery, noted that the arrangement of these units resembled the living quarters in monasteries, which is why he called them cells. Microscopes also revealed the existence of microbes. This separate universe of organisms, often made up of only a single cell, constitutes the vast majority of the living world. Prior to the invention of the microscope, no one had any idea that such life-forms might exist.
复合透镜也可以用来放大远处的物体。伽利略带着一架放大倍数为30倍的望远镜——以现代标准来看,这不过是儿戏——探索宇宙的奥秘。无论他望向何处,望远镜都能让他看到前所未有的景象。这位佛罗伦萨科学家将望远镜指向月球——长期以来人们一直认为月球是一个完美的球体——他看到了山谷、平原和山脉,而山脉的阴影清晰可见,始终指向远离太阳的方向。在探索夜空中被称为银河的明亮带时,伽利略发现它由无数暗淡的恒星组成:也就是我们今天所说的星系。但伽利略最著名的发现是当他将望远镜指向行星时。在那里,他看到了金星的盈亏和木星的卫星,也就是名副其实的新世界。
A compound lens could also be used to magnify faraway things. Armed with a telescope capable of 30X magnification—by modern standards, a child’s plaything—Galileo tackled the mysteries of the cosmos. Wherever he looked, his telescope enabled him to see more than had ever been seen before. Pointing it at the moon—long believed to be a perfect sphere—the Florentine scientist saw valleys, plains, and mountains, the latter with distinct shadows that always pointed away from the sun. Exploring the bright band across the night sky called the Milky Way, Galileo could see that it consisted of stars, faint and innumerable: what today we call a galaxy. But Galileo’s most famous discoveries came when he pointed his telescope at the planets. There he saw the phases of Venus and the moons of Jupiter, new worlds in the most literal sense.
伽利略的观察是推翻托勒密认为地球静止不动,位于万物的中心。相反,他们引入了哥白尼的太阳系观:太阳被旋转的行星包围。在伽利略灵巧的手中,光学透镜——仅仅是光的幻象——不仅引发了科学革命,也改变了宗教在西方生活中的角色。这不仅仅是现代天文学的诞生,更是现代世界的诞生。
Galileo’s observations served as decisive evidence against the Ptolemaic notion that the Earth stood still at the center of all things. Instead, they ushered in the Copernican view of the solar system: a sun surrounded by spinning planets. In Galileo’s nimble hands, the optic lens—a mere trick of the light—both launched the scientific revolution and transformed the role of religion in Western life. It was more than the birth of modern astronomy. It was the birth of the modern world.
即使在五百年后的今天,显微镜和望远镜仍然与科学进步息息相关。当然,这些设备本身也发生了变化。传统的光学成像技术已经变得更加复杂,而一些当代的显微镜和望远镜所依赖的科学原理也截然不同。例如,扫描隧道显微镜运用了20世纪量子力学的思想。然而,许多科学——在天文学、生物学、化学和物理学等众多领域——的范围仍然主要由其实际应用范围所定义——即利用当时最先进的显微镜和望远镜能够对这些领域进行哪些研究。
Even today, half a millennium later, the microscope and the telescope remain enormously relevant to the progress of science. Of course, the devices themselves have changed. Traditional optical imaging has become much more sophisticated, and some contemporary microscopes and telescopes rely on markedly different scientific principles. For instance, the scanning tunneling microscope uses ideas from twentieth-century quantum mechanics. Nonetheless, the scope of many sciences—in fields as diverse as astronomy, biology, chemistry, and physics—is still defined largely by their actual scopes—by what can be learned about those fields using the very best microscopes and telescopes available.
2005年,我们俩还是研究生的时候,花了很多时间思考科学家可以使用哪些类型的望远镜,以及这些望远镜是如何使科学研究成为可能的。一个看似天马行空的想法深深地吸引了我们。长期以来,我们俩都对历史研究很感兴趣。我们对人类文化如何随着时间推移而变化尤其着迷。有些变化非常剧烈,但通常又非常微妙,肉眼几乎无法察觉。我们想,如果我们能有一台像显微镜一样的仪器来测量人类文化,识别和追踪那些我们原本无法察觉的细微影响,那该有多好?或者,如果能有一台望远镜,让我们能够从很远的地方——在其他大陆——进行这样的观察,那该有多好?几个世纪以前?简而言之,是否有可能创造一种观察镜,它不是观察物理对象,而是观察历史变迁?
In 2005, when the two of us were graduate students, we spent a lot of time thinking about the kinds of scopes scientists had access to and the ways in which those scopes made science possible. We became intrigued by what seemed like an off-the-wall idea. For a long time, both of us had been interested in the study of history. We were especially fascinated by how human culture changes over time. Some of these changes are dramatic, but often they are so subtle as to be largely invisible to the unaided brain. Wouldn’t it be great, we thought, if we had something like a microscope to measure human culture, to identify and track all those tiny effects that we would never notice otherwise? Or a telescope that would allow us to do this from a great distance—on other continents, centuries ago? In short, was it possible to create a kind of scope that, instead of observing physical objects, would observe historical change?
当然,这算不上伽利略级别的贡献。现代世界早已存在;太阳已经位于太阳系的中心,等等等等。基本上,每个人都知道望远镜是个好东西。但我们推断,这种新型望远镜或许足够酷,哈佛大学最终可能会让我们毕业,而对于像典型的博士生一样,吃不饱、工资低、受教育过度的人来说,这几乎就是他们唯一的希望了。
Of course, this would not be a Galileo-caliber contribution. The modern world already exists; the sun is already at the center of the solar system, and so on and so forth. Basically, everyone already knows that scopes are a good thing. But, we reasoned, this new kind of scope would probably be cool enough that Harvard might finally let us graduate, which is about all you can hope for when you’re as underfed, underpaid, and overeducated as the typical PhD seeker.
就在我们思考这个略显深奥的问题时,其他地方正在发生一场革命,这场革命将席卷我们,并让数百万人与我们分享这份奇特的魅力。这场大数据革命的核心在于人类如何创造并保存自身活动的历史记录。其后果将改变我们看待自身的方式。它将开启新的视野,使我们的社会能够更有效地探索自身的本质。大数据将改变人文学科,改造社会科学,并重新审视商业世界与象牙塔之间的关系。为了更好地理解这一切是如何发生的,让我们仔细审视历史记录,从它最初的不起眼到如今无处不在。
As we were mulling this somewhat esoteric question, a revolution was occurring elsewhere that would sweep us up in its wake and lead millions of people to share our strange fascination. At its core, this big data revolution is about how humans create and preserve a historical record of their activities. Its consequences will transform how we look at ourselves. It will enable the creation of new scopes that make it possible for our society to more effectively probe its own nature. Big data is going to change the humanities, transform the social sciences, and renegotiate the relationship between the world of commerce and the ivory tower. To better understand how all this came about, let’s take a close look at the historical record, from its modest beginnings to its omnipresent present.
一万年前,史前的牧羊人会周期性地丢失羊群。他们听取了史前失眠症患者的建议,萌生了计数的想法。这些最早的会计人员用石头作为羊计数器,就像赌徒现在使用扑克筹码来记录他们的奖金一样。
Ten thousand years ago, prehistoric shepherds periodically lost their sheep. Taking advice from prehistoric insomniacs, they hit on the idea of counting. Those very first accountants used stones as sheep counters, the same way that gamblers now use poker chips to keep track of their winnings.
这一切运作良好。在接下来的四千年里,随着人们试图追踪种类日益繁多的商品,他们使用一种名为“触针”的简单雕刻工具,在一些石头上刻下图案。这些图案可以用来指示被计数物品的不同类型。最终,在公元前四千年,有人认为追踪大量小石头(石器时代零钱的祖先)不方便。取而代之的是,取一块非常大的石头,用触针在上面并排刻上许多图案,更加方便。文字由此诞生。
All this worked very well. Over the next four thousand years, as people sought to track an increasingly wide array of goods, they used a simple carving instrument called a stylus to engrave patterns on some of the stones. These patterns could be used to indicate the different types of objects being counted. Eventually, in the fourth millennium BCE, someone decided that keeping track of a lot of little rocks—the Stone Age ancestors of loose change—was inconvenient. Instead, it was easier to take one really big stone and use the stylus to engrave lots of patterns on it, side by side. Writing was born.
回想起来,像数羊这样平凡的愿望竟然会成为文字这种基础性进步的推动力,这或许令人感到意外。但对文字记录的渴望始终伴随着经济活动,因为除非你能清楚地追踪谁拥有什么,否则交易毫无意义。正因如此,早期人类的文字充斥着各种交易:各种赌注、筹码和契约。早在先知的著作问世之前,我们就有了利润的著作。事实上,许多文明从未达到记录和留下我们常常与文化史联系在一起的伟大文学作品的阶段。这些古代社会留存下来的大部分只是一堆收据。如果没有制作这些记录的商业企业,我们对这些文化的了解将远远不足。
In retrospect, it might seem surprising that something as mundane as the desire to count sheep was the impetus for an advance as fundamental as written language. But the desire for written records has always accompanied economic activity, since transactions are meaningless unless you can clearly keep track of who owns what. As such, early human writing is dominated by wheeling and dealing: a menagerie of bets, chits, and contracts. Long before we had the writings of the prophets, we had the writings of the profits. In fact, many civilizations never got to the stage of recording and leaving behind the kinds of great literary works that we often associate with the history of culture. What survives these ancient societies is, for the most part, a pile of receipts. If it weren’t for the commercial enterprises that produced those records, we would know far, far less about the cultures that they came from.
这种状况在今天比以往任何时候都更加真实。与前辈不同,如今许多商业企业不再仅仅将创建记录作为经营业务的副产品。像谷歌、Facebook 和亚马逊这样的公司都开发了工具,使其能够用户在互联网上展现自我,并相互交流。这些工具通过构建数字化、个人化的历史记录来发挥作用。
This state of affairs is truer today than ever before. Unlike their predecessors, many of today’s commercial enterprises do not create records as a mere by-product of doing business. Companies like Google, Facebook, and Amazon create tools that enable their users to represent themselves, and to interact with one another, on the Internet. These tools work by building a digital, personal, historical record.
对于这样的公司来说,记录人类文化是他们的核心业务。
For such companies, recording human culture is their core business.
而且,它不仅仅是网页、博客和在线新闻等供公众消费的记录。我们的个人交流,无论是通过电子邮件、Skype还是短信,越来越多地在网上进行。很多信息都以某种形式保存在那里,通常由多个实体保存,原则上是永久的。无论是在Twitter还是LinkedIn上,我们的个人和商业关系都在网络上列举,并由网络中介。当我们“点赞”、“推荐”或发送电子贺卡时,我们转瞬即逝的想法和印象会留下永久的数字指纹。即使我们早已忘记收件人的名字,谷歌仍会记住那封愤怒邮件的每一个字。即使我们醒来时头脑昏沉、宿醉未醒,Facebook的照片也会记录下那晚在酒吧的点滴细节。如果我们写了一本书,谷歌会扫描它;如果我们拍了一张照片,Flickr会存储它;如果我们拍了一部电影,YouTube会播放它。
And it’s not just a record of things that were meant for public consumption, like Web pages, blogs, and online news. Increasingly, our personal communication, whether via e-mail, Skype, or text message, happens online. A lot of it is preserved there in some form, often by multiple entities, and in principle forever. Whether on Twitter or LinkedIn, both our personal and business relationships are enumerated on, and mediated by, the Web. When we “plus,” “recommend,” or send an e-card, our fleeting thoughts and impressions leave a permanent digital fingerprint. Google will remember every word of that angry e-mail long after we’ve forgotten the name of the person we sent it to. Facebook’s photos will chronicle the details of that night at the bar even if we woke up with a fuzzy brain and a massive hangover. If we write a book, Google scans it; if we take a photo, Flickr stores it; if we make a movie, YouTube streams it.
当我们体验到当代生活所提供的一切时,当我们的生活越来越多地在互联网上度过时,我们开始留下越来越详尽的数字面包屑痕迹:具有惊人广度和深度的个人历史记录。
As we experience all that contemporary life has to offer, as we live out more and more of our lives on the Internet, we’ve begun to leave an increasingly exhaustive trail of digital bread crumbs: a personal historical record of astonishing breadth and depth.
所有这些加起来有多少信息?
How much information does all this add up to?
在计算机科学中,用来测量信息的单位是位(bit),是“二进制数字”的缩写。你可以把一个位想象成一个“是”或“否”问题的答案,其中1表示“是”,0表示“否”。八个位称为一个字节(byte)。
In computer science, the unit used to measure information is the bit, short for “binary digit.” You can think about a single bit as the answer to a yes-or-no question, where 1 is yes and 0 is no. Eight bits is called a byte.
目前,全球人均数据足迹(即人均每年产生的数据量)略低于1TB。这相当于约8万亿个是非题。这意味着,人类整体上产生了每年 5 ZB 的数据:40,000,000,000,000,000,000,000(四十万亿亿)比特。
Right now, the average person’s data footprint—the annual amount of data produced worldwide, per capita—is just a little short of one terabyte. That’s equivalent to about eight trillion yes-or-no questions. As a collective, that means humanity produces five zettabytes of data every year: 40,000,000,000,000,000,000,000 (forty sextillion) bits.
如此大的数字难以理解,所以让我们试着把事情说得更具体一些。如果你手写出 1 兆字节 (MB) 所包含的信息,那么一行 1 和 0 的高度将是珠穆朗玛峰的五倍多。如果你手写出 1 千兆字节 (GB),它可以绕地球赤道一圈。如果你手写出 1 太字节 (TB),它可以延伸到土星并返回 25 圈。如果你手写出 1 拍字节 (PB),你可以往返旅行者 1 号探测器——宇宙中最遥远的人造物体。如果你手写出 1 艾字节 (EB),你可以到达半人马座恒星。如果你手写出人类每年产生的 5 泽字节 (ZB),你可以到达银河系的核心。如果您不发送电子邮件和流媒体电影,而是像古代牧羊人一样使用五个泽字节来数羊,那么您可以轻松地数出充满整个宇宙的羊群,不留任何空白。
Such large numbers are hard to fathom, so let’s try to make things a bit more concrete. If you wrote out the information contained in one megabyte by hand, the resulting line of 1s and 0s would be more than five times as tall as Mount Everest. If you wrote out one gigabyte by hand, it would circumnavigate the globe at the equator. If you wrote out one terabyte by hand, it would extend to Saturn and back twenty-five times. If you wrote out one petabyte by hand, you could make a round trip to the Voyager 1 probe, the most distant man-made object in the universe. If you wrote out one exabyte by hand, you would reach the star Alpha Centauri. If you wrote out all five zettabytes that humans produce each year by hand, you would reach the galactic core of the Milky Way. If instead of sending e-mails and streaming movies, you used your five zettabytes as an ancient shepherd might have—to count sheep—you could easily count a flock that filled the entire universe, leaving no empty space at all.
这就是为什么人们把这类记录称为大数据。而今天的大数据只是冰山一角。随着数据存储技术的进步、带宽的增加以及我们的生活逐渐迁移到互联网上,智人的数据总量每两年就会翻一番。大数据只会越来越大。
This is why people call these sorts of records big data. And today’s big data is just the tip of the iceberg. The total data footprint of Homo sapiens is doubling every two years, as data storage technology improves, bandwidth increases, and our lives gradually migrate onto the Internet. Big data just gets bigger and bigger and bigger.
可以说,当今的文化记录与过去的文化记录之间最关键的区别在于,今天的大数据以数字形式存在。如同光学镜头能够可靠地转换和操控光线一样,数字媒体也能够可靠地转换和操控信息。有了足够的数字记录和足够的计算能力,人类文化的新视角就成为可能,它有可能对我们理解世界以及自身在其中的位置做出令人敬畏的贡献。
Arguably the most crucial difference between the cultural records of today and those of years gone by is that today’s big data exists in digital form. Like an optic lens, which makes it possible to reliably transform and manipulate light, digital media make it possible to reliably transform and manipulate information. Given enough digital records and enough computing power, a new vantage point on human culture becomes possible, one that has the potential to make awe-inspiring contributions to how we understand the world and our place in it.
考虑以下问题:如果你想了解当代人类社会,哪个对你更有帮助——不受限制地进入一所顶尖大学的社会学系,那里挤满了研究社会运作的专家,还是不受限制地访问 Facebook,一家旨在帮助调解人类在线社会关系的公司?
Consider the following question: Which would help you more if your quest was to learn about contemporary human society—unfettered access to a leading university’s department of sociology, packed with experts on how societies function, or unfettered access to Facebook, a company whose goal is to help mediate human social relationships online?
一方面,社会学系的教员们受益于他们毕生致力于学习和研究的精辟见解。另一方面,Facebook 是十亿人日常社交生活的一部分。它知道他们在哪里生活和工作,在哪里玩耍和和谁一起玩,他们喜欢什么,什么时候生病,以及他们和朋友聊什么。所以我们问题的答案很可能就是 Facebook。如果它现在还不是——那么二十年后,当 Facebook 或其他类似的网站存储着地球上每个人一万倍的信息时,世界会是什么样子呢?
On the one hand, the members of the sociology faculty benefit from brilliant insights culled from many lifetimes dedicated to learning and study. On the other hand, Facebook is part of the day-to-day social lives of a billion people. It knows where they live and work, where they play and with whom, what they like, when they get sick, and what they talk about with their friends. So the answer to our question may very well be Facebook. And if it isn’t—yet—then what about a world twenty years down the line, when Facebook or some other site like it stores ten thousand times as much information, about every single person on the planet?
这些思考开始促使科学家甚至人文学者做一些不寻常的事情:走出象牙塔,与大公司展开合作。尽管观点和灵感截然不同,这些“奇葩”伙伴们却开展着 前辈们难以想象的研究,所使用的数据集规模之大在人类学术史上前所未有。
These kinds of ruminations are starting to cause scientists and even scholars of the humanities to do something unfamiliar: to step out of the ivory tower and strike up collaborations with major companies. Despite their radical differences in outlook and inspiration, these strange bedfellows are conducting the types of studies that their predecessors could hardly have imagined, using datasets whose sheer magnitude has no precedent in the history of human scholarship.
斯坦福大学经济学家乔恩·莱文与eBay合作,研究现实世界市场中价格的形成机制。莱文利用了eBay卖家经常进行微型实验来确定商品定价这一事实。通过同时研究数十万个此类定价实验,莱文及其同事对价格理论有了深刻的理解。价格理论是经济学中一个发展成熟但主要理论化的分支学科。莱文指出,现有文献通常是正确的,但有时也会犯重大错误。他的研究成果影响深远,甚至帮助他获得了约翰·贝茨·克拉克奖章——这是授予40岁以下经济学家的最高奖项,通常预示着诺贝尔奖的到来。
Jon Levin, an economist at Stanford, teamed up with eBay to examine how prices are established in real-world markets. Levin exploited the fact that eBay vendors often perform miniature experiments in order to decide what to charge for their goods. By studying hundreds of thousands of such pricing experiments at once, Levin and his co-workers shed a great deal of light on the theory of prices, a well-developed but largely theoretical subfield of economics. Levin showed that the existing literature was often right—but that it sometimes made significant errors. His work was extremely influential. It even helped him win a John Bates Clark Medal—the highest award given to an economist under forty and one that often presages the Nobel Prize.
加州大学圣地亚哥分校领导的研究小组詹姆斯·福勒与Facebook合作,对6100万Facebook用户进行了一项实验。实验表明,如果一个人得知自己的密友已注册投票,他/她登记投票的可能性会大大增加。朋友越亲近,影响力就越大。这项实验不仅带来了令人瞩目的结果,还使2010年的投票人数增加了30多万,并登上了著名科学期刊《自然》的封面。这足以左右一场选举。
A research group led by UC San Diego’s James Fowler partnered with Facebook to perform an experiment on sixty-one million Facebook members. The experiment showed that a person was much more likely to register to vote after being informed that a close friend had registered. The closer the friend, the greater the influence. Aside from its fascinating results, this experiment—which was featured on the cover of the prestigious scientific journal Nature—ended up increasing voter turnout in 2010 by more than three hundred thousand people. That’s enough votes to swing an election.
东北大学的物理学家阿尔伯特-拉斯洛·巴拉巴西 (Albert-László Barabási) 与几家大型电话公司合作,追踪通过分析手机留下的数字轨迹,数百万人参与了这项研究。最终,他们以整个城市的规模,对普通人类活动进行了新颖的数学分析。巴拉巴西和他的团队非常擅长分析活动历史,有时甚至可以预测某人的下一步去向。
Albert-László Barabási, a physicist at Northeastern, worked with several large phone companies to track the movements of millions of people by analyzing the digital trail left behind by their cell phones. The result was a novel mathematical analysis of ordinary human movement, executed at the scale of whole cities. Barabási and his team got so good at analyzing movement histories that, occasionally, they could even predict where someone was going to go next.
在谷歌内部,一个由软件工程师领导的团队杰里米·金斯伯格观察到,在流感疫情期间,人们更有可能搜索流感症状、并发症和治疗方法。他们利用这一并不令人意外的事实,做了一件意义深远的事情:创建一个系统,实时监测特定地区人们的谷歌搜索内容,并识别新出现的流感疫情。他们的预警系统能够比美国疾病控制与预防中心更快地识别新发疫情,尽管美国疾病控制与预防中心为此目的维护着庞大而昂贵的基础设施。
Inside Google, a team led by software engineer Jeremy Ginsberg observed that people are much more likely to search for influenza symptoms, complications, and remedies during an epidemic. They made use of this rather unsurprising fact to do something deeply important: to create a system that looks at what people in a particular region are Googling, in real time, and identifies emerging flu epidemics. Their early warning system was able to identify new epidemics much faster than the U.S. Centers for Disease Control could, despite the fact that the CDC maintains a vast and costly infrastructure for exactly this purpose.
哈佛大学经济学家拉吉·切蒂联系了美国国税局 (IRS)。他说服IRS分享了数百万在特定城区上学的学生的信息。他和他的同事将这些信息与学区本身的第二个数据库(记录课堂作业)结合起来。这样,切蒂的团队就知道了哪些学生跟随哪些老师学习。综合所有这些信息,该团队开展了一系列令人瞩目的研究,探讨优秀教师的长期影响,并提出了一系列其他政策干预措施。他们发现,优秀的教师对学生上大学的可能性、毕业后多年的收入,甚至对学生日后在好社区安家落户的可能性都有显著的影响。该团队随后利用这些研究结果来帮助改进教师效能的衡量标准。2013年,切蒂也获得了约翰·贝茨·克拉克奖章。
Raj Chetty, an economist at Harvard, reached out to the Internal Revenue Service. He persuaded the IRS to share information about millions of students who had gone to school in a particular urban district. He and his collaborators then combined this information with a second database, from the school district itself, which recorded classroom assignments. Thus, Chetty’s team knew which students had studied with which teachers. Putting it all together, the team was able to execute a breathtaking series of studies on the long-term impact of having a good teacher, as well as a range of other policy interventions. They found that a good teacher can have a discernible influence on students’ likelihood of going to college, on their income for many years after graduation, and even on their likelihood of ending up in a good neighborhood later in life. The team then used its findings to help improve measures of teacher effectiveness. In 2013, Chetty, too, won the John Bates Clark Medal.
在煽动性的 FiveThirtyEight 博客上,一位前棒球分析师称内特·西尔弗(Nate Silver)一直在探索能否利用大数据方法预测全国选举的获胜者。西尔弗收集了大量总统民意调查数据,这些民意调查来自盖洛普、拉斯穆森、兰德公司、梅尔曼、CNN 等众多机构。利用这些数据,他正确预测了奥巴马将赢得 2008 年大选,并准确预测了 49 个州和哥伦比亚特区的选举人团获胜者。他唯一预测错误的州是印第安纳州。这并没有留下太多改进的空间,但下一次,他一定会有所进步。2012 年选举日上午,西尔弗宣布奥巴马击败罗姆尼的几率为 90.9%,并准确预测了哥伦比亚特区和所有州(包括印第安纳州)的获胜者。
And over at the incendiary FiveThirtyEight blog, a former baseball analyst named Nate Silver has been exploring whether a big data approach might be used to predict the winners of national elections. Silver collected data from a vast number of presidential polls, drawn from Gallup, Rasmussen, RAND, Mellman, CNN, and many others. Using this data, he correctly predicted that Obama would win the 2008 election, and accurately forecast the winner of the Electoral College in forty-nine states and the District of Columbia. The only state he got wrong was Indiana. That doesn’t leave much room for improvement, but the next time around, improve he did. On the morning of Election Day 2012, Silver announced that Obama had a 90.9 percent chance of beating Romney, and correctly predicted the winner of the District of Columbia and of every single state—Indiana, too.
这样的例子不胜枚举。如今的研究人员正在利用大数据进行着他们的前辈们做梦也想不到的实验。
The list goes on and on. Using big data, the researchers of today are doing experiments that their forebears could not have dreamed of.
本书讲述的是其中一个实验的故事。
This book is the story of one of those experiments.
我们实验的对象不是人、青蛙、分子或原子。相反,我们实验的对象是历史上最引人入胜的数据集之一:一个数字图书馆,其既定目标是涵盖每本曾经写过的书。
The object of our experiment was not a person or a frog or a molecule or an atom. Instead, the object of our experiment was one of the most fascinating datasets in the history of history: a digital library whose stated goal is to encompass every book ever written.
这座非凡的图书馆从何而来?
Where did this remarkable library come from?
1996 年,两名斯坦福大学计算机科学研究生正在研究一项现已不复存在的项目,即斯坦福数字图书馆技术项目。该项目的目标是构想未来的图书馆,一个将书籍世界与万维网融合的图书馆。他们致力于开发一种工具,使用户能够浏览图书馆藏书,在网络空间中快速浏览书籍。但由于当时数字图书数量相对较少,该项目在实践中难以实现。因此,两人将他们从一篇文本导航到另一篇文本的理念和技术,沿着大数据的轨迹延伸到万维网,并将他们的成果转化为一个小型搜索引擎。他们将其命名为谷歌。
In 1996, two Stanford computer science graduate students were working on a now-defunct effort known as the Stanford Digital Library Technologies Project. The goal was to envision the library of the future, a library that would integrate the world of books with the World Wide Web. They worked on a tool for enabling users to navigate through library collections, jumping from book to book in cyberspace. But this was not something that could be implemented in practice at the time, because relatively few books were available in digital form. So the pair took their ideas and techniques for navigating from one text to another, followed the big data trail to the World Wide Web, and turned their work into a little search engine. They called it Google.
到2004年,谷歌自诩的“整合全球信息”的使命进展顺利,创始人拉里·佩奇也因此有了一些空闲时间,可以回归他最初的挚爱——图书馆。令人沮丧的是,当时仍然只有少数书籍以数字形式提供。但这些年来,情况发生了变化:佩奇如今已是亿万富翁。于是,他决定让谷歌进军图书扫描和数字化业务。佩奇心想,既然谷歌已经涉足图书扫描和数字化领域,不如把所有业务都做起来。
By 2004, Google’s self-appointed mission to “organize the world’s information” was going pretty well, leaving founder Larry Page with some free time to get back to his first love, libraries. Frustratingly, it was still the case that only a few books were available in digital form. But something had changed in the intervening years: Page was now a billionaire. So he decided that Google would get into the business of scanning and digitizing books. And while his company was at it, Page thought, Google might as well do all of them.
雄心勃勃?毫无疑问。但谷歌一直在努力实现这一目标。在公开宣布该项目九年后,谷歌已将超过3000万册图书数字化。这相当于迄今为止出版的图书中每四本就有一本。谷歌的藏书量超过了哈佛大学(1700万册)、斯坦福大学(900万册)、牛津大学博德利图书馆(1100万册)或任何其他大学图书馆。谷歌的藏书量超过了俄罗斯国家图书馆(1500万册)、中国国家图书馆(2600万册)和德国国家图书馆(2500万册)。截至本文撰写时,唯一藏书量超过谷歌的图书馆是美国国家图书馆。美国国会图书馆(3300 万)。当你读到这句话的时候,谷歌可能也已经超过了他们。
Ambitious? No doubt. But Google has been pulling it off. Nine years after publicly announcing the project, Google has digitized more than 30 million books. That’s about one in every four books ever published. Its collection is bigger than that of Harvard (17 million volumes), Stanford (9 million), Oxford’s Bodleian (11 million), or any other university library. It has more books than the National Library of Russia (15 million), the National Library of China (26 million), and the Deutsche Nationalbibliothek (25 million). As of this writing, the only library with more books is the U.S. Library of Congress (33 million). By the time you read this sentence, Google may have passed them, too.
谷歌图书项目启动时,我们和其他人一样,在新闻里读到了相关报道。但直到两年后的2006年,我们才真正感受到谷歌这个项目的影响。当时,我们正在完成一篇关于英语语法史的论文。为了完成这篇论文,我们手动对一些古英语语法教科书进行了小规模的数字化。
When the Google Books project was getting started, we, along with everyone else, read about it in the news. But it wasn’t until two years later, in 2006, that the impact of Google’s undertaking really sank in. At the time, we were finalizing a paper on the history of English grammar. For our paper, we had manually done some small-scale digitization of Old English grammar textbooks.
与我们研究最相关的书籍被深埋在哈佛大学怀德纳图书馆的深处。以下是找到它们的方法。首先,前往东翼二楼。走过罗斯福藏书区和美洲印第安语言区;你会看到一个过道,上面有8900及以上的索书号。我们的书放在从上往下数第二个架子上。多年来,随着研究的进展,我们经常去这个架子上翻阅。多年来,有时甚至几十年来,我们是唯一把这些书借出来的人。除了我们自己,没有人真正关心我们的架子。
The books most relevant to our research were buried in the bowels of Harvard’s Widener Library. Here’s how to find them. First, go to floor 2 of the East Wing. Walk past the Roosevelt Collection and the Amerindian languages section; you’ll see an aisle with call numbers 8900 and up. Our books were on the second shelf from the top. For years, as our research progressed, we made frequent trips to this shelf. We were the only people who had taken those books out in years, and sometimes in decades. No one cared much about our shelf but us.
有一天,我们突然发现,一本我们经常用来学习的书现在可以通过谷歌图书项目在网上找到。好奇心驱使我们开始搜索书架上的其他书籍。它们也在那里。这并非因为谷歌公司关心中世纪的英语语法。几乎我们查看的每一本书,无论放在哪个书架上,现在都有了相应的数字版本。在我们查阅几本书的时间里,谷歌已经将一些建筑数字化了。
One day, we realized that a book we had been using regularly for our study was now available on the Web, as part of the Google Books project. Curious, we started searching for other books on our shelf. They were there too. Not because the Google corporation cared about English grammar in the Middle Ages. Nearly every book that we checked, no matter what shelf it was on, now had a digital counterpart. In the time that it took us to examine a handful of books, Google had digitized a handful of buildings.
谷歌的“按楼出售书籍”项目代表了一种全新的大数据,它有可能改变人们看待历史的方式。大多数大数据规模庞大,但内容却很短:它们是由近期事件产生的近期记录。这是因为底层数据的产生是由互联网这一相对较新的创新所催化的。我们的目标是研究那些能够跨越漫长时间的文化变迁,就像一代又一代人生生死死一样。在探索历史时间尺度的变化时,无论数据量有多大,短数据都没什么用。
Google’s books-by-the-building represented a completely new type of big data, and it had the potential to transform the way that people look at the past. Most big data is big but short: recent records produced from recent events. This is because the creation of the underlying data was catalyzed by the Internet, a relatively recent innovation. Our goal was to study the kinds of cultural changes that can span long time periods, as generation after generation of people lives and dies. When it comes to exploring changes on historical time scales, short data, no matter how big, isn’t very useful.
谷歌图书的数据集几乎和我们这个数字媒体时代的任何数据集一样大。但谷歌正在数字化的很多内容并非当代的:与电子邮件、RSS feed 和超级书签不同,图书记录可以追溯到几个世纪以前。因此,谷歌图书不仅仅是大数据,它还是长数据。
Google Books is as big a dataset as almost any in our age of digital media. But much of what Google is digitizing isn’t contemporary: Unlike e-mails, RSS feeds, and superpokes, the book record goes back for centuries. So Google Books isn’t just big data, it’s long data.
由于数字化书籍包含如此长的数据,它们并不像大多数大型数据集那样局限于描绘当代人类的图景。书籍还能描绘出我们的文明在相当长的一段时间内是如何演变的——这段时间比人的寿命还要长,甚至比整个国家的寿命还要长。
Since they contain such long data, digitized books aren’t limited to painting a picture of contemporary humanity, as most big datasets are. Books can also offer a portrait of how our civilization has changed over fairly long periods of time—longer than the length of a human life, longer even than the lifetimes of whole nations.
书籍之所以成为引人入胜的数据集,还有其他原因。它们涵盖了极其广泛的主题,反映了多元的视角。探索大量的藏书,可以被认为是对大量人的调查,其中许多人恰好已经去世。在历史和文学领域,特定时间和地点的书籍是了解该时间和地点的最重要信息来源之一。
Books are a fascinating dataset for other reasons, too. They cover an extraordinary range of topics and reflect a wide range of perspectives. Exploring a large collection of books can be thought of as surveying a large number of people, many of whom happen to be dead. In the fields of history and literature, the books of a particular time and place are among the most important sources of information about that time and that place.
这向我们表明,通过数字视角审视谷歌的图书,有可能建立一个研究人类的范围历史。无论花多长时间,我们都知道我们必须获得这些数据。
This suggested to us that, by examining Google’s books through a digital lens, it would be possible to build a scope to study human history. No matter how long it took us, we knew we had to get our hands on that data.
大数据为我们了解周围的世界创造了新的机会,但也带来了新的科学挑战。
Big data creates new opportunities to understand the world around us, but it also creates new scientific challenges.
一个主要挑战是,大数据的结构与科学家通常遇到的数据类型截然不同。科学家倾向于使用能够始终如一地提供准确结果的精妙实验来回答精心构建的问题。但大数据是杂乱的数据。典型的大数据集是一堆事实和测量值的混杂,它们并非出于科学目的,而是使用临时程序收集的。它充满了错误,并存在着许多令人沮丧的缺失:缺失了任何理性科学家都想知道的信息。这些错误和遗漏通常不一致,即使在被认为是单个数据集的数据集中也是如此。这是因为大数据集通常是通过聚合大量较小的数据集而创建的。这些组成数据集中,总会有一些比其他数据集更可靠,而且每个数据集都有其自身的特性。Facebook 的社交网络就是一个很好的例子。在 Facebook 网络的不同部分,“加某人为好友”的含义不同。有些人会慷慨地加好友,而另一些人则更加谨慎。有些人会加同事为好友,而另一些人则不会。处理大数据的部分工作就是要深入了解你的数据,以便能够逆向工程这些怪异之处。但是,你对PB级数据的了解程度究竟能有多深呢?
One major challenge is that big data is structured very differently from the kinds of data that scientists typically encounter. Scientists prefer to answer carefully constructed questions using elegant experiments that produce consistently accurate results. But big data is messy data. The typical big dataset is a miscellany of facts and measurements, collected for no scientific purpose, using an ad hoc procedure. It is riddled with errors, and marred by numerous, frustrating gaps: missing pieces of information that any reasonable scientist would want to know. These errors and omissions are often inconsistent, even within what is thought of as a single dataset. That’s because big datasets are frequently created by aggregating a vast number of smaller datasets. Invariably, some of these component datasets are more reliable than others, and each one is subject to its own idiosyncrasies. Facebook’s social network is a good example. Friending someone means different things in different parts of the Facebook network. Some people friend liberally. Others are much cagier. Some friend co-workers, but others don’t. Part of the job of working with big data is to come to know your data so intimately that you can reverse engineer these quirks. But how intimate can you possibly be with a petabyte?
第二个主要挑战是大数据与我们通常认为的科学方法不太契合。科学家喜欢验证特定的假设,并逐渐将他们所学到的知识汇编成因果故事,最终形成数学理论。在任何相当有趣的大数据集中摸索一下,你必然会有所发现——比如,公海海盗行为率和大气温度之间的相关性。这种探索性研究有时被称为“无假设”研究,因为你永远不知道一开始会发现什么。但是,当需要从因果关系的角度解释这些相关性时,大数据就显得无解了。海盗会导致全球变暖吗?炎热的天气会让更多人参与公海海盗活动吗?如果这两者没有关联,那么为什么近年来它们都在增加?大数据常常让我们猜测。
A second major challenge is that big data doesn’t fit too well into what we typically think of as the scientific method. Scientists like to confirm specific hypotheses, and to gradually assemble what they’ve learned into causal stories and eventually mathematical theories. Blunder about in any reasonably interesting big dataset and you will inevitably make discoveries—say, a correlation between rates of high-seas piracy and atmospheric temperature. This kind of exploratory research is sometimes called “hypothesis free,” since you never know, going in, what you’ll find. But big data is much less incisive when it comes time to explain these correlations in terms of cause and effect. Do pirates bring about global warming? Does hot weather make more people take up high-seas piracy? And if the two are unrelated, then why are they both increasing in recent years? Big data often leaves us guessing.
随着我们不断积累无法解释和解释不足的模式,一些人认为相关性正在威胁因果关系作为科学叙事基石的地位。甚至有人认为大数据的出现将导致理论的终结。但这种观点有点难以接受。现代科学最伟大的胜利之一是像爱因斯坦的广义相对论或达尔文的自然选择进化论这样的理论,它们用一组第一性原理解释了复杂现象的原因。如果我们停止追求这样的理论,我们就有可能忘记科学的本质。当我们有数百万个发现,却无法解释其中任何一个时,这意味着什么?这并不意味着我们应该放弃解释事物。这只意味着我们已经做好了充分的准备。
As we continue to stockpile unexplained and underexplained patterns, some have argued that correlation is threatening to unseat causation as the bedrock of scientific storytelling. Or even that the emergence of big data will lead to the end of theory. But that view is a little hard to swallow. Among the greatest triumphs of modern science are theories, like Einstein’s general relativity or Darwin’s evolution by natural selection, that explain the cause of a complex phenomenon in terms of a small set of first principles. If we stop striving for such theories, we risk losing sight of what science has always been about. What does it mean when we can make millions of discoveries, but can’t explain a single one? It doesn’t mean that we should give up on explaining things. It just means that we have our work cut out for us.
最后一个重大挑战是数据存储方式的变化。作为科学家,我们习惯于通过在实验室进行实验或到自然界记录观察结果来获取数据。在某种程度上,获取数据属于科学家的职责范围。控制。但在大数据的世界里,大公司,甚至政府,往往是最强大数据集的守门人。他们、他们的公民和他们的客户都非常关心这些数据的使用方式。很少有人希望国税局与初露头角的学者分享他们的纳税申报单,无论这些学者的初衷多么好。eBay 上的卖家不希望他们完整的交易记录成为公共信息或被随机的研究生获取。搜索引擎日志和电子邮件享有隐私和保密权。书籍和博客的作者受版权保护。公司对其控制的数据拥有强大的所有权。他们可能会分析数据以期获得更多广告收入,但他们不愿与外界分享其竞争优势的核心,尤其是那些不太可能为其盈利做出贡献的学者和科学家。
A final major challenge is the change in where the data lives. As scientists, we are used to getting data by experimenting in our laboratories or going out into the natural world to write down our observations. Getting data is, to some extent, within the scientist’s control. But in the world of big data, major corporations, and even governments, are often the gatekeepers of the most powerful datasets. And they, their citizens, and their customers care a great deal about how the data is used. Very few people want the IRS to share their tax returns with budding scholars, however well-intentioned those scholars might be. Vendors on eBay don’t want a complete record of their transactions to become public information or to be made available to random grad students. Search engine logs and e-mails are entitled to privacy and confidentiality. Authors of books and blogs are protected by copyright. And companies have strong proprietary interests in the data they control. They may analyze their data with a view toward generating more ad revenue, but they are loath to share the heart of their competitive advantage with outsiders, and especially scholars and scientists who are unlikely to contribute to their bottom line.
由于所有这些原因,人类自我认知史上一些最强大的资源在很大程度上被闲置了。尽管社交网络的研究已有数十年历史,但几乎没有对Facebook这个完整社交网络进行过任何公开研究,因为该公司几乎没有动力去分享这些信息。尽管经济市场理论已有数百年历史,但大多数主要在线市场的交易细节仍然对经济学家来说基本上难以获取。(莱文对eBay的研究是个例外,而不是普遍现象。)尽管人类花费了数千年的时间绘制世界地图,但像DigitalGlobe这样的公司制作的图像——他们已经制作了分辨率高达50厘米的地球表面卫星图像——却从未被系统地探索过。仔细想想,我们人类通常永不满足的学习欲望中存在的这些差距并进行探索令人震惊。这就好比天文学家花费了许多年的时间试图研究遥远的恒星,但却因为法律原因而从未被允许凝视太阳。
For all these reasons, some of the most powerful resources in the history of human self-knowledge are going largely unused. Despite the fact that the study of social networks is many decades old, almost no public work has been done on the full social network of Facebook, because the company has little incentive to share it. Despite the fact that the theory of economic markets is centuries old, the detailed transactions of most major online markets remain largely inaccessible to economists. (Levin’s eBay study was the exception, not the rule.) And despite the fact that humans have spent millennia striving to map the world, the images produced by companies like DigitalGlobe, which has created fifty-centimeter-resolution satellite images of the entire surface of the Earth, have never been systematically explored. When you think about it, these gaps in our usually insatiable human desire to learn and explore are shocking. This would be as if astronomers spent many lifetimes trying to study the distant stars, but for legal reasons were never permitted to gaze at the sun.
然而,仅仅知道太阳的存在,就足以让人忍不住凝视它。如今,世界各地正在上演一场奇特的求偶舞。学者和科学家与工程师、产品经理,甚至高管接洽,希望获取他们公司的数据。有时,最初的谈话很顺利。他们会一起出去喝咖啡。一来一去,一年后,一个全新的人出现在他们的视野中。不幸的是,这个人通常是一名律师。
Still, just knowing that the sun is there can make the desire to stare at it irresistible. And so today, all over the world, a strange mating dance is taking place. Scholars and scientists approach engineers, product managers, and even high-level executives about getting access to their companies’ data. Sometimes the initial conversation goes well. They go out for coffee. One thing leads to another, and a year later, a brand-new person enters the picture. Unfortunately, this person is usually a lawyer.
在分析谷歌的万物图书馆时,我们必须找到应对这些挑战的方法。因为数字图书带来的障碍并非个例;它们只是当今大数据现状的一个缩影。
As we worked to analyze Google’s library of everything, we had to find ways to deal with each of these challenges. Because the obstacles posed by digital books are not unique; they are merely a microcosm of the state of big data today.
这本书讲述了我们七年来量化历史变迁的努力。最终成果是一种全新的视角,以及一种奇特、引人入胜、令人着迷的语言、文化和历史研究方法,我们称之为“文化组学”。
This book is about our seven-year effort to quantify historical change. The result is a new kind of scope and a strange, fascinating, and addictive approach to language, culture, and history that we call culturomics.
我们将描述运用文化组学方法可以进行的各种观察。我们将讨论 ngram 数据揭示了英语语法如何变化、词典如何出错、人们如何成名、政府如何压制思想、社会如何学习和遗忘,以及我们的文化如何——在细微之处——表现出确定性,从而能够预测我们集体未来的方方面面。
We’ll describe all sorts of observations that can be made using a culturomic approach. We’ll talk about what our ngram data has revealed about how English grammar changes, how dictionaries make mistakes, how people get famous, how governments suppress ideas, how societies learn and forget, and how—in little ways—our culture can appear to behave deterministically, making it possible to predict aspects of our collective future.
当然,我们会向你介绍我们的新工具:我们与谷歌合作开发的一个工具,名为“Ngram Viewer”(其名称含义将在第三章中阐明)。Ngram Viewer 发布于 2010 年,它可以绘制词语和概念随时间变化的频率。这个工具——以及促成其诞生的海量计算——就是我们开篇小故事中的机器人历史学家。您现在就可以访问 http://books.google.com/ngrams 亲自试用。我们的机器人是一个勤奋的机器人,世界各地数百万人,不分年龄,不分昼夜地使用它,他们都希望以一种新的方式理解历史:通过绘制未知的图景。
And of course, we’ll introduce you to our new scope: a tool we created with Google, called—for reasons that will become apparent in chapter 3—the Ngram Viewer. Released in 2010, the Ngram Viewer charts the frequency of words and ideas over time. This scope—and the massive computation that led to its creation—is the robot historian of our opening vignette. You can try it yourself, right now, at http://books.google.com/ngrams. Ours is a hardworking robot, used by millions of people, of all ages, all over the world, at all hours of day or night, all hoping to understand history in a new way: by charting the uncharted.
简而言之,这本书讲述的是机器人讲述的历史,讲述的是从数字视角看待人类过去的景象。尽管如今的Ngram Viewer可能被视为奇特或特例,但数字视角正在蓬勃发展,就像几个世纪前的光学视角一样。在我们蓬勃发展的数字足迹的推动下,新的视角每天都在涌现,揭示着历史、地理、流行病学、社会学、语言学、人类学,甚至生物学和物理学中曾经被隐藏的领域。世界在变化。我们看待世界的方式也在变化。而我们看待这些变化的方式……嗯,也在变化。
In short, this book is about history as it is told by the robots, about what the human past looks like when viewed through a digital lens. And though today the Ngram Viewer might be seen as odd or exceptional, the digital lens is flourishing, much in the same way that the optical lens did centuries ago. Powered by our burgeoning digital footprint, new scopes are popping up every day, exposing once-hidden aspects of history, geography, epidemiology, sociology, linguistics, anthropology, and even biology and physics. The world is changing. The way we look at the world is changing. And the way we look at those changes . . . well, that’s changing too.
我1911年,美国报纸编辑亚瑟·布里斯班曾对一群营销人员说过一句名言:一张图片“胜过千言万语”。他又提出,一张图片胜过“一万言”,或者说“一百万言”?总之,几十年之内,这句话风靡全国,甚至被宣传为一句日本谚语——或许布里斯班对此感到懊恼。(毕竟,他的听众都是从事营销行业的。)
In 1911, the American newspaper editor Arthur Brisbane famously told a group of marketers that a picture is “worth a thousand words.” Or he famously proposed that it’s worth “ten thousand words.” Or was it “a million words”? In any case, within decades, the expression had swept the country and—probably to Brisbane’s chagrin—was now being billed as a Japanese proverb. (His listeners were in marketing, after all.)
布里斯班究竟说了什么?唉,我们的新望远镜不太可能记录下这种表达的首例。日本也有一句谚语与之对应:
What did Brisbane actually say? Alas, our new scope isn’t likely to record the first instance of this expression. There’s a Japanese proverb for that, too:
与所有演讲相比,
Compared to all speech,
Grasshopper,谷歌的扫描图书
Grasshopper, Google’s scanned books
只是一首俳句
are but a haiku
尽管如此,这个范围可以帮助我们了解布里斯班的标志性经济原则是如何形成的。
Still, the scope can help us see how Brisbane’s principle of iconic economics took shape.
事实证明,千字、万字和百万字的变体在布里斯班那句(或许是)决定性的言论后不久就出现了。这三种形式在接下来的二十年里相互竞争。万字一度领先。但到了30年代:在大萧条时期,万字和百万字是否显得过高?无论原因是什么,在那些年里,“一图胜千言”开始缓慢上升,将其竞争对手远远甩在身后。
It turns out that the thousand words, ten thousand words, and million words variants emerged shortly after Brisbane’s (possibly) fateful remarks. All three forms competed for the next two decades. Ten thousand jumped to an early lead. But then came the ’30s: Did ten thousand and million seem exorbitant to Depression-era ears? Whatever the cause, those years saw “a picture is worth a thousand words” begin the slow ascent that left its competition in the dust.
GK Zipf 和化石猎人
G. K. ZIPF AND THE FOSSIL HUNTERS
美丽美丽美丽美丽美丽美丽美丽美丽美丽美丽美丽美丽美丽美丽美丽美丽美丽美丽美丽美丽美丽美丽,美丽,美丽,美丽,美丽,美丽,美丽,美丽,“美丽。美丽。美丽。”美丽。。美丽。。。
beautiful beautiful beautiful beautiful beautiful beautiful beautiful beautiful beautiful beautiful beautiful beautiful beautiful beautiful beautiful beautiful beautiful, beautiful, beautiful, beautiful, beautiful, beautiful, beautiful, beautiful,” beautiful. beautiful. beautiful.” beautiful . . . beautiful . . .
——传奇的、词汇丰富的、饶舌的爱情
—Legendary, Lexical, Loquacious Love
我1996年,概念艺术家凯伦·雷默 (Karen Reimer) 出版了《传奇的、词汇丰富的、饶舌的爱情》(Legendary, Lexical, Loquacious Love)一书。她的写作过程如下:她将一本言情小说的全文按字母顺序排列。如果一个词在小说中出现多次,那么它也会在她的书中出现多次。
In 1996, the concept artist Karen Reimer published the book Legendary, Lexical, Loquacious Love. Here is how she wrote it: She took the full text of a romance novel and alphabetized it. If a word appeared multiple times in the novel, it appears multiple times in her book.
这本书没有语法,也没有句子。它长达345页,只是一份按字母顺序排列的单词表。它看起来和读起来都不像一本小说。事实上,当你读它的时候,它看起来完全是胡言乱语。
The book has no syntax and no sentences. It is a 345-page-long list of words in alphabetical order. It does not look or read like a novel. In fact, when you read it, it appears to be complete nonsense.
我们很少读言情小说,但赖默的作品是个例外。它绝对引人入胜,从头到尾都引人入胜,从戏剧性的开头开始:
We rarely read romance novels, but Reimer’s work is an exception. An absolute page-turner, it fascinated us from cover to cover, from the dramatic beginning:
啊啊
啊啊啊啊啊啊啊啊啊啊啊啊啊啊啊啊
A
A A A A A A A A A A A A A A A A A A A A A A A A A A A A A A A A A A A A A A A A A
最后还有一个令人惊讶的结局:
And all the way through the surprising finish:
第二十五章
Chapter Twenty-Five
Z
热心
Z
zealous
二十五章,而不是二十六章:没有关于X的章节,因为小说中没有以字母X开头的单词。言情小说虽然被评为 XXX 级,但实际上包含的以X开头的单词却很少。
Twenty-five chapters, not twenty-six: There is no chapter for X, as the novel contained no words beginning with the letter X. Romance novels may be XXX-rated, but they contain very few actual X-words.
虽然《传奇词汇饶舌的爱情》只是一本书,但它却对整个言情小说流派提供了深刻的见解。例如,很明显这是一本写给她的书—— “她”这个词几乎占据了整整八页(130-138)。而写给他的呢?两页半(141-144)。书中有半页写眼睛,三分之一页写乳房,但只有一行写臀部。偶尔,这本书还很露骨——仅第62页就有三个高潮。加油,女孩!(或者男孩;我们无从知晓。)
Even though it’s just a single book, Legendary, Lexical, Loquacious Love gives really suggestive insights into the entire romance genre. For instance, it’s clear that this is a book for her—the word her occupies almost a full eight pages (130–38). His? Two and a half (141–44). There’s half a page of eyes and a third of a page of breasts, but only a single line about buttocks. Occasionally, the book is downright racy—there are three climaxes on page 62 alone. You go, girl! (Or guy; there’s no way for us to know.)
有时,这本书在表面描写上花了太多时间。比如,“美丽”出现了29次。“聪明”?只出现了一次。但有时,读者能从中窥见原著的情节,比如第187页一段令人毛骨悚然的文字:“凶手凶手,谋杀谋杀谋杀谋杀……”谋杀谋杀谋杀,谋杀谋杀。谋杀谋杀,阴暗的低语低语。
Sometimes the book dwells too long on the superficial. For instance, beautiful appears twenty-nine times. Intelligent? Only once. But at other times, one gets a whiff of the original book’s plot, such as a bone-chilling passage on page 187: “Murderers murderers, murdering murdering murdering murdering murdering murdering murdering, murderous murderous. murders murders, murky murmur murmured.”
多年来,我们一次又一次地阅读这本书,每次都能发现有趣的新内容。
Over the years, we’ve turned to this book again and again, finding interesting new nuggets each time.
这有点奇怪。你可能会认为,如果将言情小说按字母顺序排列,从而抹杀其意义,莱默也会抹去所有让小说引人入胜的元素。某种程度上来说,这的确没错。但在这个过程中,莱默对字母顺序的转化揭示了一个曾经不可见的世界:词频,也就是构成小说的词汇原子。正是这些词频——以及它们所讲述的故事——让她的作品如此引人入胜。
This is a bit odd. You would think that by alphabetizing a romance novel, and thereby obliterating its meaning, Reimer would also eliminate everything that made the novel interesting. And that’s true, to an extent. But in the process, Reimer’s alphabetical transmutation reveals a world that was once invisible: word frequencies, the lexical atoms from which the novel was composed. Those frequencies—and the stories they tell—are what make her work such an engaging read.
2005年我们见面的时候,大数据当时还不存在。我们还没想过在一瞬间读几百万本书。我们只是些年轻的研究生,想着研究那些我们能找到的最有趣的问题。
When we met in 2005, big data was not yet a thing. The thought of reading millions of books in a split second hadn’t entered our minds. We were just young graduate students looking to ply our trade on the most interesting questions we could find.
要想找到一个引人入胜的问题,一个引人入胜的环境很有帮助。我们在哈佛大学进化动力学项目 (PED),一个由魅力非凡的数学家兼生物学家马丁·诺瓦克 (Martin Nowak) 创立的创意与科学天堂。PED(进化动力学项目?进化动力学项目?天天聚会?)汇聚了数学家、语言学家、癌症研究人员、宗教学者、心理学家和物理学家,共同思考看待世界的新视角。诺瓦克鼓励我们去解决那些我们最感兴趣的问题,无论它们身处何方。
To find a fascinating question, it’s helpful to have a fascinating environment. We met at Harvard’s Program for Evolutionary Dynamics, a haven of creativity and science founded by the charismatic mathematician and biologist Martin Nowak. The PED (Program for Evolutionary Dynamics? Program for rEvolutionary Dynamics? Party Every Day?) is a place where mathematicians, linguists, cancer researchers, religious scholars, psychologists, and physicists congregate, thinking about new ways to look at the world. Nowak encouraged us to tackle the problems we found most interesting, regardless of where they might be found.
是什么让一个问题如此引人入胜?答案众说纷纭。在我们看来,引人入胜的问题,是那些小孩子可能会问的问题,没有人知道如何回答,而只需几个人年的科学探索——我们自己能够付出的努力——就可能带来有意义的进展。孩子们是科学家灵感的重要源泉,因为他们提出的问题,虽然表面上简单易懂,却往往意义深远。诸如“晚上太阳去哪儿了?”之类的问题,“为什么天空是蓝色的?”自然而然地将好奇的头脑引向天文学和物理学的核心。诸如“一棵树能长到像山一样高吗?”或“如果我们非常非常小心地避免事故,我们就能永远活下去吗?”等问题涉及现代生物学中一些最紧迫的问题。“我为什么必须睡觉?”这句陈词滥调仍然让神经科学家夜不能寐。
What makes a problem fascinating? No one really agrees. It seemed to us that a fascinating question was something that a young child might ask, that no one knew how to answer, and for which a few person-years of scientific exploration—the kind of effort we could muster ourselves—might result in meaningful progress. Children are a great source of ideas for scientists, because the questions they ask, though superficially simple and easy to understand, are so often profound. Questions like “Where does the sun go at night?” and “Why is the sky blue?” naturally lead the curious mind right into the heart of astronomy and physics. Questions like “Could a tree ever grow to be as tall as a mountain?” or “If we were really, really careful to avoid accidents, would we live forever?” turn on some of the most urgent issues in modern biology. “Why do I have to go to sleep?”—a tired cliché—still keeps neuroscientists up at night.
但在所有这些问题中,有一个问题特别引起了我们的注意。“为什么我们说drove而不是drived?”
But of all these questions, one in particular caught our eye. “Why do we say drove and not drived?”
这个问题引起了我们的好奇,因为它是一个简单的例子,却蕴含着对人类的深刻关怀。作为一种文化,为什么我们会使用某些词语而不是其他词语?为什么我们会有某些想法而不是其他想法?为什么我们会遵守某些规则而不是其他规则?
This question intrigued us because it was a simple example of a very profound concern about mankind. Why, as a culture, do we use certain words and not others? Why do we have certain ideas and not others? Why do we obey certain rules and not others?
面对这样的问题,有两种可能的方法。一种是关注导致某件事以某种方式发生的当前情况。例如:“亲爱的孩子,你说‘drove’是因为其他人都说‘drove’,而如果你说‘drived’,邻居们会认为我们,你的父母,没有费心教你正确的英语。” 这是一个很好的答案,它提出了关于社会规范本质的复杂问题,这些问题是哲学家们长期以来一直在努力解决的。几个世纪以来,科学家一直在探索这个问题。但有时,从长远角度看问题会更有启发。
Faced with a question like this, there are two possible approaches. One is to focus on the present circumstances that lead to a certain thing being a certain way. For instance: “Beloved child, you say drove because everyone else says drove and because, if you were to say drived, the neighbors would think that we, your parents, didn’t bother to teach you proper English.” This is a fine answer, which raises complex issues about the nature of social norms, issues that philosophers have been grappling with for centuries. But sometimes it can be more illuminating for a scientist to take the long view.
毫无疑问,科学史上最令人印象深刻的长远眼光范例当属查尔斯·达尔文的著作。150多年前,达尔文乘船远航,邂逅了各种各样的生物。他开始思考在加拉帕戈斯群岛看到的一些鸟类:为什么这些雀类的喙是那样的?更广泛地说,为什么所有生物都是那样的?
Surely the most impressive example of the long view in the history of the sciences lies in the work of Charles Darwin. More than 150 years ago, Darwin took a boat trip and encountered all sorts of creatures. He began to wonder about some birds that he saw in the Galápagos: Why are the beaks of those finches the way they are? More generally, why are all organisms the way they are?
达尔文接下来的举动极具洞察力。他没有只关注当下,而是放眼长远。达尔文自问:“事物是如何随着时间推移演变成如今这个样子的?”达尔文推断,如果我们想理解当今世界,就必须理解导致我们现状的变化过程。这个变化过程——达尔文的开创性发现——是繁殖、突变和自然选择的结合,它们共同解释了生物世界惊人的多样性。换句话说,这就是进化论。
What Darwin did next was extremely insightful. Instead of focusing entirely on the present, he took the long view. Darwin asked himself, How did things come to be this way over time? If we want to understand the world as it is now, Darwin reasoned, we must understand the process of change that brought about our present conditions. That process of change—Darwin’s seminal discovery—is the combination of reproduction, mutation, and natural selection that together explain the remarkable diversity of the living world. In other words, the theory of evolution.
从长远来看,我们为什么说“ drove”(驱动)而不是“ drived”(驱动)这个问题,就变成了一场科学探索,探究塑造人类文化演化的力量。很长一段时间里,我们甚至不知道该如何揭示这些力量。我们只有一个孩子般的疑问。
If you take the long view, the question of why we say drove and not drived becomes a scientific quest for the forces that shape the evolution of human culture. For a long time, we had no idea how to even begin to uncover those forces. All we had was a childlike question.
作为科学家,我们需要能够收集数据:冷酷无情的事实和精确的测量。我们需要能够构建明确的假设,然后尝试通过确凿的实验和决定性的分析来证伪它们。从这个角度来看,文化——难以定义,更难衡量——这可能是一个难以攻克的难题。正因如此,人类学等领域的科学工作才面临如此巨大的挑战,也是美国人类学协会在2010年做出颇具争议的决定——将“科学”一词从其宗旨声明中删除的原因之一。(该词后来又恢复了。)
As scientists, we need to be able to collect data: cold, hard facts and precise measurements. We need to be able to frame unambiguous hypotheses, and then try to falsify them using definitive experiments and decisive analyses. From that standpoint, culture—hard to define, harder still to measure—can be a tough nut to crack. This is what makes scientific work in fields like anthropology such an immense challenge, and is part of why, in 2010, the American Anthropological Association made the controversial decision to remove the word science from its statement of purpose. (The word has since been restored.)
我们决定从文化中一个更容易定义和衡量的狭义层面入手:语言。语言是整个文化研究的一个巨大缩影。它是人类文化传播的主要载体。它会发生变化,这一点任何看过莎士比亚戏剧的人都会明白。最后,语言通常被记录下来,并以这种形式为科学分析提供了便捷的数据集。毕竟,书面语言是大数据的早期祖先之一。
We decided to start with a narrow aspect of culture that is much easier to define and to measure: language. Language is a great microcosm of the study of culture as a whole. It is the primary vehicle by which human culture is communicated. It changes, as is apparent to anyone who’s ever wound up in the audience of one of Shakespeare’s plays. Finally, language is often written down, and in that form furnishes a convenient dataset for scientific analysis. After all, written language is one of the earliest ancestors of big data.
那么,我们应该如何探索语言的进化呢?在生物学中,没有比观察化石更好的方法来理解进化的普遍模式了。但寻找化石并非易事,需要精心的规划和良好的策略。如果我们希望在寻找化石方面取得进展,我们应该借鉴内森·米尔沃德(Nathan Myhrvold)或许是他那一代最伟大的恐龙猎人。(他才华横溢,还创立了微软研究院,并撰写了一本关于现代烹饪的书籍。)米尔沃德并非比其他人幸运,他偶然发现的每一块白色岩石最终都变成了霸王龙头骨。米尔沃德和他的团队利用详细的地质图、卫星图像以及他们自己对霸王龙生态环境的细致分析,来决定在哪里进行探索,以及哪里的白色岩石最有可能成为化石。结果,自1999年以来,他们已经发现了9具霸王龙骨骼——而当时在世界上只发现了18具这样的骨骼。在过去的九十年里,我们一直保持着霸王龙的统治地位。正如 Myhrvold 所说:“我们占据了霸王龙市场的主导地位。”
So how should we go about exploring the evolution of language? In biology, there is no better way to understand the broad patterns of evolution than by looking at fossils. But finding fossils is hard. It requires a combination of careful planning and good strategy. If we hope to make progress finding fossils, we would do well to learn from Nathan Myhrvold, perhaps the greatest dinosaur hunter of his generation. (A man of many talents, he also founded Microsoft Research and wrote the book on modernist cuisine.) It’s not that Myhrvold is luckier than everyone else, and that every whitish rock he haplessly blunders across turns out to be a T. rex skull. Myhrvold and his team use detailed geological maps, satellite images, and their own painstaking analysis of T. rex ecology to decide where to explore, where the whitish rocks are likeliest to be fossils. As a result, they’ve found nine T. rex skeletons since 1999—when only eighteen such skeletons had been found in the ninety previous years. As Myhrvold puts it, “We have dominant T. rex market share.”
我们的目标是占据语言化石市场的主导地位。正如恐龙化石揭示生物进化史一样,语言化石也能帮助我们理解语言的演化过程。但如果我们想有更大的机会找到这样的化石,就需要某种指导原则来帮助我们确定挖掘地点。事实证明,80年前,一位和我们一样热爱数数的人,就创造了这样一个指南针。
Our ambition was to get dominant language-fossil market share. Just as dinosaur fossils tell us about biological evolution, linguistic fossils would help us understand how language evolves. But if we wanted to have a good chance of finding such fossils, we needed some kind of guiding principle to help us figure out where to dig. As it turns out, just such a compass had been created eighty years ago, by a man who, like us, really liked to count.
乔治·金斯利·齐普夫(George Kingsley Zipf)于20世纪30至40年代在哈佛大学任教,担任德国文学系主任。他拥有相当罕见的综合技能:他是一位杰出的人文学者,同时又非常偏重量化研究。
George Kingsley Zipf was at Harvard in the 1930s and 1940s, and was chair of the German literature department. He had a mix of skills that is rather rare: a prominent humanist, but with a very quantitative bent.
齐普夫是一位文人,他花了大量时间思考词语。他清楚地意识到,并非所有词语都生来平等。“ the” (那)这个词我们经常使用,但我们很少听到“quiescence”(静止)这个词。齐普夫对这种不平衡感到困惑,并想弄清楚究竟是怎么回事。
Being a man of letters, Zipf spent a lot of time thinking about words. It was rather obvious to Zipf that all words are not created equal. The word the is used all the time, but we rarely hear the word quiescence. Zipf found this imbalance puzzling and wanted to understand what was going on.
思考齐普夫问题的一种方式是:想象英语是一个国家,每个单词代表一个公民。假设每个单词代表一个公民,每个单词代表一个公民。这个词的使用频率—— “寂静”这个词可能很大,而“寂静”这个词很小。生活在这些体型奇特的人中间会是什么感觉?齐普夫觉得这种孩子气的问题很吸引人。
One way to think about Zipf’s question is as follows: Imagine that the English language were a nation, and each word a citizen. And imagine that the height of each word-person were proportional to the frequency of that word’s use—the would be a giant word, but quiescence would be tiny. What would it be like to live among such oddly sized people? That’s the kind of childlike question Zipf found fascinating.
为了描绘这个世界的样子,齐普夫需要统计所有单词,并计算每个单词的使用次数。如今,这种事情在计算上很简单(一行命令即可)。这就是为什么概念艺术书《传奇、词汇、饶舌的爱情》没有花费数十年时间创作的原因。但在1937年,没有什么事情在计算上是简单的。现代计算机还不存在。“计算机”一词指的是从事算术计算的研究人员。
To picture what this world looks like, Zipf needed to take a census of all the words and count how many times each one was used. Today, this kind of thing is computationally trivial (a one-line command). That’s why the concept art book Legendary, Lexical, Loquacious Love didn’t take decades to write. But back in 1937, nothing was computationally trivial. Modern computers didn’t exist. The word computer meant a researcher whose job was to perform arithmetic calculations.
如果要统计单词数量,齐普夫就得用老办法,把每个单词一个一个地手工记录下来。当然,那样会无聊透顶。
If he was going to count words, Zipf would have to do it the old-fashioned way, by recording every instance of every word, one by one, by hand. Of course, that would be soul-crushingly boring.
当他偶然发现迈尔斯·L·汉利。汉利是《尤利西斯》的忠实粉丝,他曾出版过一部煞费苦心、充满英雄气概的作品,并给詹姆斯·乔伊斯的《尤利西斯》取了一个略显枯燥的名字——《词汇索引》。这本书,一种被称为“索引”的学术著作,旨在帮助《尤利西斯》的学者和爱好者找到书中所有单词的每一个实例。对齐普夫来说,没有哪本书比这本书更令人兴奋了。为了解决他最初的问题,齐普夫只需要拿出汉利的索引,计算一下每个条目的长度。这要容易得多。
He must have been pretty ecstatic when he came across the work of Miles L. Hanley. Hanley, who was a big fan of Ulysses, had published a painstaking and heroic work that he had given the rather boring title Word Index to James Joyce’s Ulysses. This book, a type of scholarly work known as a concordance, was meant to allow fellow Ulysses scholars and enthusiasts to find every instance of any word in the book. To Zipf, no book could have been more exciting. In order to get at his original problem, all Zipf had to do was take Hanley’s index and count how long each of the entries was. Much, much easier.
值得注意的是,齐普夫远远领先于他的时代,理解了当今科学家和人文学者才刚刚开始学习的东西:如何追踪数据。齐普夫巧妙地根据他掌握的数据类型重新构建了他所关注的问题。他没有去解决统计所有单词这个不可能的问题,而是选择了《尤利西斯》中易于处理的问题——统计单词。如果他今天还活着,谷歌宣布图书数字化项目的那一刻,他一定会冲到公司门口。
Note that Zipf understood, well ahead of his time, what scientists and humanists today are just beginning to learn: how to follow the data. Zipf skillfully reframed the questions he cared about in light of the kind of data available to him. Instead of tackling the impossible problem of counting all words, he settled for the tractable problem of counting words in Ulysses. If he were alive today, he would have been at Google’s door the moment the company announced the book digitization project.
齐普夫根据汉利索引对《尤利西斯》中的单词进行了排序,并根据其出现频率对其进行了排序。排在首位的是14,877 次——每十八个单词中就有一个。第十个最常用的词是I,出现了 2,653 次。Say出现了 265 次,排在第一百位。Step出现了 26 次,在 Zipf 的排名列表中排在第 1000 位。要并列第一万位,就像indisputable这个词一样,一个词只需要出现两次。
Equipped with Hanley’s index, Zipf ranked the words in Ulysses by their frequency. The top spot is taken by the, used 14,877 times—one out of every eighteen words. The tenth most frequent word is I, with 2,653 appearances. Say, which appeared 265 times, comes in at one hundredth place. Step, which occurs 26 times, appears in the thousandth spot on Zipf’s ranked list. To be tied for the ten thousandth position, like the word indisputable, a word needed only appear twice.
当齐普夫查看他的排名列表时,他注意到一些有趣的事情。单词的排名和其使用频率之间存在反比关系。如果一个词的数字排名高十倍——第五百位而不是第五十位——那么它就是罕见的十倍。因此,他的单词排名第八,出现了 3,326 次,比排名第八十的eyes出现 330 次的频率高十倍。一种等效的思考方式是说,罕见词的数量比你想象的要多得多。在《尤利西斯》中,只有 10 个单词的使用次数超过 2,653 次。但是有一百个单词的使用次数超过 265 次,有一千个单词的使用次数超过 26 次,依此类推。
As he looked over his ranked list, Zipf noticed something funny. There was an inverse relationship between the rank of a word and its frequency of use. If a word’s numerical rank was ten times as high—five hundredth place instead of fiftieth—then it was ten times as rare. So his, ranked eighth with 3,326 mentions, is ten times more frequent than eyes, ranked eightieth, which appears 330 times. An equivalent way of thinking about this is to say that there are far more rare words than you might expect. In Ulysses, only ten words are used more than 2,653 times. But there are a hundred words used more than 265 times, and a thousand words used more than 26 times, and so on and so forth.
而且,齐普夫很快发现,这不仅仅是乔伊斯《尤利西斯》中词汇的特征。同样的规律性也出现在报纸、中文和拉丁文文本以及他所研究的几乎所有其他地方的词汇中。齐普夫定律的发现今天被证明是所有已知语言的普遍组织原则。
And, as Zipf soon discovered, this wasn’t just a feature of words in Joyce’s Ulysses. The same regularity appeared in words taken from newspapers, texts written in Chinese and Latin, and pretty much everywhere else he looked. Called Zipf’s law today, the discovery turned out to be a universal organizing principle of all known languages.
在 Zipf 之前,科学家认为大多数可以测量的事物都像人类的身高一样。
Before Zipf, scientists thought that most things you could measure behaved like human height.
人类身高差异不大。90%的美国成年人的身高在 5 英尺到 6 英尺 1 英寸之间。当然,一些非常高的篮球运动员有 7.5 英尺,而世界上最矮的成年人身高略低于 2 英尺。但这两种情况都非常非常罕见。即使考虑到这些极端情况,最高的人的身高也只不过是最矮的人的四到五倍。数学家对这种数值紧密聚集在平均值附近的分布有一个专门的词。他们将这种常见的分布称为“正态分布”。在 Zipf 之前,人们认为我们生活在一个正常的世界里,一切都很正常。
Human height doesn’t vary terribly much. Ninety percent of the adults in the United States are between five feet and six-foot-one. Sure, some extremely tall basketball players are seven and a half feet, and the world’s smallest adult is just under two feet tall. But both cases are very, very rare. And even when you consider these extremes, the tallest people are only four to five times as tall as the shortest. Mathematicians have a special word for this kind of distribution, where the values are so tightly clustered around an average. They call this commonly observed distribution “normal.” Before Zipf, people thought we lived in a normal world, where things were all normal.
但正如我们所见,文字的世界远非正常,其大小分布遵循着一种非常特殊且看似奇怪的数学模式。如今,科学家将这些行为称为幂律。令人惊讶的是,一旦齐普夫在语言中发现了他的第一个幂律,他就开始在任何地方发现它们。
But as we’ve seen, the world of words is far from normal, with a distribution of sizes that obeys a very specific, and seemingly strange, mathematical pattern. Today, scientists call these behaviors power laws. Surprisingly, once Zipf found his first power law in language, he started to find them everywhere.
例如,齐普夫发现财富和收入都表现出幂律。如果你的身高与你的银行账户成正比,并且美国家庭的平均身高是五英尺七英寸,那么比尔·盖茨就会比月亮还高。大英百科全书的文章长度也遵循幂律,报纸发行量也是如此。跟踪 Zipf 工作的科学家发现了数以千计的其他例子:城市的大小、特定姓氏的频率、战争的血腥程度、人们在表演后鼓掌的时间长短、Facebook 和 Twitter 上用户的受欢迎程度、动物消耗的食物量、网站的流量、我们细胞中蛋白质的丰富程度、我们身体中细胞的丰富程度、我们生态系统中物种的丰富程度以及瑞士奶酪中孔洞的大小。甚至停电时间的长短也遵循幂律——或者我们应该称之为“缺电”定律。
For instance, Zipf found that both wealth and income exhibit power laws. If your height were proportional to your bank account, and the average American household were five-foot-seven, then Bill Gates would be taller than the moon. The lengths of articles in the Encyclopædia Britannica also obey a power law, as do newspaper circulation rates. Scientists following up on Zipf’s work found thousands of other examples: the size of cities, the frequency of particular last names, the bloodiness of wars, how long people clap after a performance, the popularity of people on Facebook and Twitter, the amount of food consumed by animals, the traffic at Web sites, the abundance of proteins in our cells, the abundance of cells in our bodies, the abundance of species in our ecosystems, and the size of holes in Swiss cheese. Even the length of power outages obeys a power law—or perhaps we should call it a “lack of power” law.
尽管齐普夫的工作具有变革性,但他提出的普适定律背后的原因仍然是个谜。齐普夫本人认为,该定律的出现是因为这种分布效率最高。另一些人则指出,规模大往往更容易变得更大,科学家将这一过程称为“富者愈富”。数学上已证明,“富者愈富”的过程可以导致各种幂律。例如,认识人更容易结识新朋友,因此最初受欢迎的人会随着时间的推移变得越来越受欢迎,就像齐普夫定律一样。规模已经很大的城市可能会吸引那些考虑搬家的人,从而导致城市规模的幂律。这里还有另一种解释:可以证明,猴子在电脑上随机打字会产生“单词”(由空格分隔的字符),其丰富度表现出幂律。
Although Zipf’s work was transformative, the reasons behind his ubiquitous law remain mysterious. Zipf himself believed that it emerged because such distributions were maximally efficient. Others have pointed out that being big often makes it easier to get bigger, a process known to scientists as “rich get richer.” Mathematically, it has been shown that a “rich get richer” process can lead to all sorts of power laws. For instance, knowing people makes it easier to meet new people, so initially popular people will get more and more popular over time, in Zipfian fashion. Cities that are already large might be appealing to someone considering a move, leading to power laws of city size. Here’s yet another account: It can be shown that monkeys typing on a computer at random would produce “words” (characters separated by a space) whose abundance exhibits a power law.
对于任何特定幂律分布的成因,通常都有多种相互竞争的解释。唉,这种过多的解释可能反映出科学家并不真正知道到底发生了什么。
There are often multiple competing explanations for the cause of any particular power-law distribution. Alas, this overabundance of explanations probably reflects the fact that scientists don’t really know what’s going on.
然而,无论其成因如何,幂律都恰如其分地描述了一系列令人惊叹的自然和社会现象。德语教授齐普夫在汉利对小说《尤利西斯》非凡热情的推动下,引发了一场革命,其后果彻底改变了定量社会科学的大部分领域,其触角已延伸至生物学、物理学,甚至数学。齐普夫就是新常态。
Still, whatever their cause, power laws aptly describe a stunning range of natural and social phenomena. Zipf, a professor of German, aided by Hanley’s uncommon enthusiasm for the novel Ulysses, set off a revolution whose consequences transformed much of quantitative social science and whose tentacles have reached biology, physics, and even mathematics. Zipf is the new normal.
齐普夫定律正是我们寻找语言进化遗迹的试金石。语言中几乎所有事物都遵循齐普夫定律:名词、动词、形容词、以“m”开头的副词、 表示职业的词语、与“rhyme”押韵的词语等等。所以,当你遇到不符合齐普夫普遍原理的事物时,就一定有事情非常可疑。就像一块白色的岩石在一个特别有希望的探险地点显得格格不入一样,语言中不遵循幂律的现象,很可能就是我们语言进化的化石。
Zipf’s law was just the touchstone we needed to go hunting for remnants of language evolution. Virtually everything in language obeys Zipf’s law: nouns, verbs, adjectives, adverbs that start with m, words for professions, words that rhyme with rhyme, and so on and so forth. So, when you run into something that does not behave according to Zipf’s universal principle, something really fishy is going on. Like a whitish rock that seems out of place at a particularly promising expedition site, a phenomenon in language that doesn’t obey a power law just might turn out to be a fossil of our language’s evolution.
这就是那个让我们如此着迷的童真问题:“为什么我们说driving而不是drived?”
That’s where the childlike question that had so captivated us comes in: “Why do we say drove and not drived?”
Drove属于一类英语单词,称为不规则动词。不规则动词很奇怪。如果不规则动词遵循齐普夫定律,就像几乎所有其他类型的词一样,你就会认为它们中的大多数都是罕见的。然而,几乎所有不规则动词非常常见。虽然只有大约3%的动词是不规则的,但最常用的十个动词都是不规则的。简而言之,不规则动词是齐普夫定律的一个明显而引人注目的例外。它们正是我们一直在寻找的东西,就好像霸王龙骨骼的位置被一块统计墓碑方便地标记了一样。
Drove is a member of a class of English words called the irregular verbs. Irregular verbs are strange. If irregular verbs obeyed Zipf’s law, the way almost all other classes of words do, you would expect most of them to be rare. Instead, nearly all irregular verbs are very frequent. Although only about 3 percent of verbs are irregular, the ten most frequent verbs are all irregular. To put it simply, irregular verbs are a clear and dramatic exception to Zipf’s law. They were exactly what we had been looking for, as though the position of the T. rex skeleton had been conveniently marked by a statistical headstone.
这些所谓的不规则动词是什么?它们对齐普夫有什么影响?这对语言的进化意味着什么?
Who were these so-called irregular verbs, what had they done to Zipf, and what did that mean about the evolution of language?
英语动词变位乍一看轻而易举。要构成英语动词的过去式,只需添加-ed即可:jump变成jumped。成千上万的动词都遵循这个简单的规则。当新的动词进入语言时,它们默认遵循这个规则。我可能以前从未听说过flamboozing ,但我知道,如果你昨天选择了flambooze,那么昨天你就已经 flamboozed 了。
English verb conjugation is, at first glance, a walk in the park. To form the past tense of an English verb, all you have to do is to add -ed: jump becomes jumped. Hundreds of thousands of verbs obey this simple rule. When new verbs enter the language, they obey this rule by default. I may have never heard of flamboozing before, but I know that if you chose to flambooze yesterday, then yesterday you flamboozed.
除了让英语学习者懊恼的令人讨厌的不规则动词。动词喜欢知道。甚至在你读到这句话之前,你可能就知道我们不说knowed。不规则动词大约有三百个,有时被语言学家称为强动词,包括英语中最常用的十个动词:be/was、have/had、do/did、say/said、go/went、get/got、make/made、know/knew、see/saw、think/thought。 它们是如此频繁,以至于当你使用一个动词时,有 50% 的可能性它是不规则的。
Except—much to the chagrin of English learners—for the pesky irregular verbs. Verbs like to know. Even before you read this sentence, you probably knew that we don’t say knowed. About three hundred in all, the irregular verbs—sometimes called strong verbs by linguists—include the ten most frequent verbs in the English language: be/was, have/had, do/did, say/said, go/went, get/got, make/made, know/knew, see/saw, think/thought. They are so frequent that, when you use a verb, there is a 50 percent chance that it will be irregular.
这些不规则生物从何而来?说来话长。大约在六千到一万两千年前,一个现代学者称之为原始印欧语的语言当时被广泛使用。包括英语、法语、西班牙语、意大利语、德语、希腊语、捷克语、波斯语、梵语、乌尔都语、印地语以及数百种其他语言在内的大量现代语言都源自原始印欧语。原始印欧语拥有一套被学者们称为元音变换(ablaut)的系统,该系统通过按照固定规则改变元音,将一个词转换为一个相关的词。在英语中,元音变换仍然以微妙的模式存在于不规则动词中。
Where did the irregulars come from? It’s a long story. Sometime between six thousand and twelve thousand years ago, a language known to modern scholars as Proto-Indo-European was spoken. An astonishing array of modern languages, including English, French, Spanish, Italian, German, Greek, Czech, Persian, Sanskrit, Urdu, Hindi, and hundreds of others, descend from Proto-Indo-European. Proto-Indo-European had a system, known to scholars as the ablaut, that transformed a word into a related one by changing its vowels according to fixed rules. In English, the ablaut can still be seen in the form of subtle patterns among the irregular verbs.
这里有一个模式的例子:今天我唱歌,昨天我唱歌,这首歌被唱了。类似地:今天我响铃,昨天我响了,电话响了。这里还有另一个模式:今天我坚持,昨天我坚持。今天我挖,昨天我挖了。当动词变位规则消亡时,它们会留下化石。我们称这些化石为不规则动词。
Here is an example of one pattern: Today I sing, yesterday I sang, the song was sung. Similarly: Today I ring, yesterday I rang, the phone has rung. Here’s another pattern: Today I stick, yesterday I stuck. Today I dig, yesterday I dug. When rules of conjugation die, they leave behind fossils. We call those fossils irregular verbs.
是什么样的语法小行星摧毁了这些古老的规则,只留下了不规则的枯骨?
What sort of grammatical asteroid wiped out these ancient rules, leaving behind only the dry bones of the irregulars?
那颗小行星就是所谓的齿后缀,在现代英语中写作-ed。用-ed表示过去时的做法起源于原始日耳曼语,这是一种在公元前500年至250年间在斯堪的纳维亚半岛使用的语言。
That asteroid was the so-called dental suffix, written as -ed in Modern English. The use of -ed to signify the past tense emerged in Proto-Germanic, a language spoken between 500 and 250 BCE in Scandinavia.
原始日耳曼语是所有现代日耳曼语系语言的祖先,包括英语、德语、荷兰语以及许多其他语言。由于原始日耳曼语是原始印欧语系的后裔,它继承了古老的动词变位元音字母变换方案。这种方案在大多数情况下都行之有效。但偶尔也会有新的动词进入该语言,其中一些并不完全符合任何古老的元音字母变换模式。因此,原始日耳曼语的使用者发明了一些新的东西,通过添加 -ed 来构成这些新兴的、不墨守成规的动词的过去式。在原始日耳曼语中,规则动词是个例外。
Proto-Germanic was the linguistic ancestor of all the modern Germanic languages, including English, German, Dutch, and many others. Because it was a descendant of Proto-Indo-European, Proto-Germanic inherited the old ablaut scheme for conjugating verbs. And this worked fine most of the time. But occasionally, new verbs entered the language, and some of these didn’t quite fit any of the old ablaut patterns. So the speakers of Proto-Germanic invented something new, forming the past tense of these young, nonconformist verbs by adding that -ed. In Proto-Germanic, the regular verbs were the exception.
但好景不长。用齿后缀来标记过去时是一项极其成功的发明,它开始迅速传播。如同任何颠覆性技术一样,这项新规则始于边缘地带,为那些元音变换无法使用的、看起来很奇特的动词服务。但一旦它站稳了脚跟,就没有停下来的迹象。简单易记的齿后缀开始吸引更多的追随者,因为那些一直使用古老的元音变换模式的动词开始转向使用元音变换。
But not for long. Use of the dental suffix to mark the past tense was a tremendously successful invention, and it began to spread rapidly. Like any disruptive technology, the new rule started at the margins, serving funky-looking verbs that the ablaut could not. But once it had established this beachhead, it did not stop. Simple and memorable, the dental suffix began to attract additional adherents, as verbs that had always used the venerable ablaut patterns started making the switch.
因此,到大约1200年前,也就是经典古英语文本《贝奥武甫》写成之时,超过四分之三的英语动词都遵循着这条新规则。随着新规则的逐渐式微,旧的元音变换(ablaut)如今已形同虚设,而新兴的-ed规则则在各处紧追其后。在接下来的一千年里,越来越多的不规则动词形式背离了旧规则。一千年前,我本可以帮你。然而就在昨天,我本可以帮你。
Thus, by the time that the classic Old English text Beowulf was written, about 1,200 years ago, more than three-quarters of English verbs obeyed the new rule. With its strength eroded, the old ablaut was now on the run, the upstart -ed rule everywhere nipping at its heels. More and more irregular forms defected over the next thousand years. A millennium ago, I would have holp you. Just yesterday, though, I would have helped you.
今天的语言学家事后看来,把这个过程称为规则化。而且它仍在继续。以动词thrive为例。大约九十年前,《纽约时报》的头条新闻是“赌场在比利·巴斯蒂德的日子里蓬勃发展”。但在 2009 年,《纽约时报》在其科学版块刊登了一篇文章,题为“一些软体动物在大规模灭绝后蓬勃发展”。与那些幸运的软体动物不同, throve是 ablaut 大规模灭绝的受害者。没有回头路了:一旦动词变得规则,它们几乎就不会再变得不规则。每当有sneak潜入,就会有更多的fly飞走。
This is a process that today’s linguists, with the benefit of hindsight, call regularization. And it’s still going on. Consider the verb thrive. About ninety years ago, a headline in the New York Times read “Gambling Halls Throve in Billy Busteed’s Day.” But in 2009, the Times ran an article in its Science section titled “Some Mollusks Thrived After a Mass Extinction.” Unlike those lucky mollusks, throve was a victim of the mass extinction of the ablaut. There is no going back: Once they are regular, verbs almost never irregularize. For every sneak that snuck in, there are many more flews that flied out.
如同温泉关战役中的三百名斯巴达勇士,英语的不规则动词——三百,强大——一直在坚决抵御着始于公元前500年的一场无情的进攻。这场战斗,他们每天都在进行,在每座城市,每一座城镇,每一条说英语的街道上。他们一直在这场战争持续了2500年。他们不仅仅是例外,他们是幸存者。
Like the three hundred Spartans at Thermopylae, the English irregular verbs—three hundred, strong—have been resolutely holding off a merciless assault on their kind that began in 500 BCE. It is a battle they have waged every day, in every city, in every town, along every street where English is spoken. They have been waging it for 2,500 years. They are not merely exceptions: They are survivors.
而它们存活下来的过程正是我们想要研究的过程:语言的进化。
And the process that they survived was exactly the process that we intended to study: the evolution of language.
为什么某些不规则形态会灭绝,而另一些却能幸存下来?为什么“throve”没有继续繁衍生息,而“drive”却没有消失?
Why did certain irregular forms die out, while others managed to survive? Why didn’t throve thrive on, and why didn’t drove drive off?
语言学家们对不规则动词为何如此频繁已经有了一些很好的解释。他们推断,我们遇到的不规则动词越少,就越难学,也越容易忘记。正因如此,像throve这样罕见的不规则动词比像driven这样常见的不规则动词消失得更快。随着时间的推移,低频率的不规则动词逐渐消失,而整体的不规则动词频率则逐渐上升。
Linguists already had some great ideas about why irregular verbs have such high frequencies. They reasoned that the less often we encounter an irregular verb, the harder it is to learn and the easier it is to forget. Because of this, rare irregular verbs, like throve, disappear more rapidly than the frequent ones, like drove. Over time, low-frequency irregulars drop out, and the irregulars as a whole become more frequent.
对我们来说,这个假设极其令人兴奋,因为它表明不规则动词正在经历一个与自然选择进化相同的过程。为什么根据齐普夫定律,其他所有词汇类别都由稀有词主导,而这些不规则动词却如此频繁地出现呢?因为自然选择,以永不满足的-ed规则的形式,赋予了常见的不规则动词进化优势。动词出现的频率越高,就越适合生存。
To us, this hypothesis was extremely exciting, as it suggested that irregular verbs are undergoing a process identical to evolution by natural selection. Why are the irregulars so frequent, when, in accordance with Zipf’s law, every other lexical class is dominated by rare words? Because natural selection, in the form of the insatiable -ed rule, gives common irregulars an evolutionary advantage. The more frequent a verb is, the more fit it is to survive.
这是迄今为止我们所遇到的关于自然选择作用于人类文化的最清晰的描述。齐普夫的指南针指引我们去解决一个引人入胜的问题:语言学家的推测经得起仔细推敲吗?如果答案是肯定的,那将是人类文化可以通过自然选择进化,这是一个简单的例子。就像齐普夫一样,我们现在要做的就是找到数据。
This was by far the tidiest account of natural selection operating on human culture that we had ever encountered. Zipf’s compass had guided us to a fascinating problem: Would the linguists’ hunch be borne out under careful scrutiny? If so, it would be a simple illustration that human culture can evolve by natural selection. Like Zipf, all we had to do now was find the data.
为了帮助我们完成这项任务,我们招募了哈佛大学两位非常聪明的本科生,乔·杰克逊和蒂娜·唐。理想情况下,我们希望乔和蒂娜能够阅读所有出版的英语书籍,并记录下他们遇到的每一个不规则动词。但他们告诉我们,他们俩都计划在四年内毕业。(作为博士生,我们很少想到毕业。)我们需要随机应变。
To aid us in our quest, we enlisted two extremely bright undergraduates at Harvard College, Joe Jackson and Tina Tang. In an ideal world, we hoped that Joe and Tina could read everything ever published in the English language and record every instance of an irregular verb that they encountered. But they told us that they were both planning on graduating in four years. (As doctoral students, the thought of graduating rarely crossed our minds.) We would need to improvise.
幸运的是,乔和蒂娜从齐普夫的故事中学到了很多。他们想到了一个替代方法。与其把所有东西都读一遍,不如直接读一遍。关于历史英语语法的教科书?比如说,中世纪英语的语法教材肯定会讨论不规则动词,会提到很多,而且很可能会在某个地方提供部分列表。通过浏览图书馆,阅读每一本关于历史英语语法的教科书,我们大概就能很好地了解哪些是不规则动词,以及什么时候是不规则动词。这些语法教科书对我们的作用,正如汉利的《尤利西斯》论文对齐普夫的作用一样。
Fortunately, Joe and Tina had learned a great deal from the story of Zipf. They hit on an alternative approach. Instead of reading absolutely everything, why not just read all the textbooks on historical English grammar? Grammar texts of, say, Middle English, would surely discuss irregular verbs, would mention many of them, and would probably provide a partial listing somewhere. By going through the library and reading every single textbook dealing with the grammar of historical Englishes, we could probably get a pretty good picture of what was irregular and when. These grammar textbooks could do for us exactly what Hanley’s Ulysses treatise did for Zipf.
当然,说起来容易做起来难。乔和蒂娜花了数月时间一丝不苟地研究古英语(《贝奥武甫》的语言,大约公元800年使用)和中世纪英语(乔叟的语言,大约十二世纪使用)的教科书。他们挖掘出了177个古英语不规则动词,每个动词都能追溯到一千多年前。凭借这一千年的快照,我们终于能够看到这门语言是如何演变的。
Of course, this is easier said than done. Joe and Tina did many months of meticulous work, reading textbooks of Old English (the language of Beowulf, spoken circa 800 CE) and of Middle English (Chaucer’s language, spoken around the twelfth century). They dug up 177 Old English irregular verbs, each of which they could track for more than a thousand years. With a millennium’s worth of snapshots, we could finally see how the language was changing.
所有177个动词最初都是古英语中的不规则动词。到那时四个世纪后的中世纪英语中,只有145种不规则形式幸存下来;其余32种已经规则化。到了现代英语,只有98种仍然是不规则形式。其余79个动词仍然存在于该语言中,但像melt一样,它们的形式已经发生了变化。
All 177 verbs started as irregulars in Old English. By the time of Middle English, four centuries later, only 145 of the irregular forms survived; the remaining 32 had regularized. By Modern English, only 98 of them remained irregular. The other 79 verbs are still in the language, but, like melt, they have changed form.
然而,其中存在着惊人的不平衡。在我们列表中12个最常用的动词中,没有一个动词被规则化——它们都顶住了-ed规则长达12个世纪的压力。而另一方面,动词的“受害者”却随处可见。在我们列表中12个最不常用的动词中,有11个动词被规则化,包括像bide和wreak这样的动词。唯一幸存下来的低频不规则动词是slink,这个动词恰如其分地描述了这种悄无声息的消失过程。
Yet there was a striking imbalance. Among the 12 most frequent verbs in our list, none had become regular—they had all resisted twelve centuries of pressure by the -ed rule. At the other end of the spectrum, the casualties were everywhere. Of the 12 least frequent verbs on our list, 11 had become regular, including verbs like bide and wreak. The only low-frequency irregular to survive is slink, a verb that aptly describes this quiet process of disappearance.
数据已经说明了一切:某种类似自然选择的东西正在影响人类文化,并在动词中留下痕迹。使用频率对动词的存续有着极其显著的影响,决定了哪些动词是哀悼/哀悼的,哪些是适合/适合生存的。
The data had spoken: Something akin to natural selection was influencing human culture, leaving its fingerprints among the verbs. Usage frequency was having an extraordinarily strong effect on verb survival, making the difference between the verbs that were mourn/mourned and the verbs that were fit/fit to survive.
在生物学中,证明某种特性的自然选择正在发生,比衡量该特性与进化适应度之间的确切关系要容易得多。(判断风力强弱很容易,但判断风力有多强却要困难得多。)如果没有对适应度的评估,我们所能知道的只是进化会倾向于什么样的变化;我们不知道这些变化需要多长时间才能发生。
In biology, it is much easier to show that natural selection for a trait is happening than to measure the exact relationship between that trait and evolutionary fitness. (It’s easy to tell that it’s windy, but much harder to tell how strongly the wind is blowing.) Without estimates of fitness, all we know is what sort of changes evolution will favor; we have no idea how long it will take for those changes to come about.
然而,不规则动词的情况与生物进化的典型情况不同。在生物学中,成千上万甚至数百万要计算单个生物体的适应度,必须考虑特征。对于不规则动词来说,显然只有一个特征——使用频率——是决定其适应度最重要的因素。这极大地简化了问题。这意味着我们或许能够准确地估计动词不规则形式的消失速度。
The case of the irregular verbs, though, is not like the typical case of biological evolution. In biology, thousands or even millions of traits must be taken into account to compute the fitness of a single organism. For the irregulars, it was clear that a single trait—usage frequency—was by far the most significant factor in determining fitness. This simplified matters immensely. It meant that we might be able to reliably estimate how quickly the irregular forms of verbs would disappear.
但在深入探讨这个问题之前,让我们先回顾一下科学史上最著名的消失现象:放射性理论。
But before we dive into that, let us remind you of the most famous disappearing act in all of science: the theory of radioactivity.
放射性物质用途广泛,从动力反应堆到医学成像系统,再到炸弹。这些物质不断处于衰变过程中,因为随着时间的推移,放射性物质的原子会转变为稳定的非放射性原子。衰变会释放能量,通常以无线电波的形式存在。放射性物质因此得名。
Radioactive materials are used in everything from power reactors to medical imaging systems to bombs. These materials are constantly in the process of disappearing, because, as time passes, atoms of a radioactive substance morph into stable, nonradioactive atoms. This decay releases energy, often in the form of radio waves. That’s how radioactive substances got their name.
放射性物质最重要的特性是其半衰期。这指的是该物质样本中一半原子衰变所需的平均时间。假设一种物质的半衰期为一年。如果你一开始将十亿个该物质的原子放入一个罐子中,一年后,只剩下五亿个原子,剩下的五亿个原子会衰变成其他物质。两年后,只剩下二十五亿个原子(一半的一半)。三年后,只剩下八分之一的原子。以此类推。
The most important property of a radioactive substance is its half-life. This is the period of time it takes, on average, for half of the atoms in a sample of the substance to decay. Suppose you have a substance whose half-life is one year. If you start with a billion atoms of that substance in a jar, then a year later, only half a billion atoms of the substance will be left—the other half billion will have decayed into something else. After two years, only one-quarter of a billion atoms (half of a half) will be left. After three years, an eighth. And so on.
在研究不规则动词向规则动词的转化时,我们发现,一旦考虑到频率,规则化过程在数学上与放射性原子的衰变难以区分。此外,如果我们知道不规则动词的频率,就可以用公式计算它的半衰期。这很了不起,因为对于放射性原子来说,必须通过实验测量半衰期;这通常无法计算。从这个意义上讲,放射性数学更适用于不规则动词,而不是放射性原子。
As we examined the transformation of irregular verbs into regular verbs, we found that, once one took frequency into account, the process of regularization was mathematically indistinguishable from the decay of a radioactive atom. Moreover, if we knew the frequency of an irregular verb, we could use a formula to compute its half-life. This was remarkable, because for radioactive atoms, you have to measure the half-life experimentally; it’s usually impossible to compute. In this respect, the mathematics of radioactivity applied even more neatly to irregular verbs than to radioactive atoms.
这个公式简洁而优美:动词的半衰期与其频率的平方根成正比。一个频率低一百倍的不规则动词,其规则化速度会快十倍。
The formula was simple and beautiful: The half-life of a verb scales as the square root of its frequency. An irregular verb that is one hundred times less frequent will regularize ten times as fast.
例如,频率在百分之一到千分之一之间的动词——比如“喝”或“说话”——的半衰期约为5400年。这与碳14的半衰期(5715年)相当,碳14是最著名的古代文物年代测定同位素。
For instance, verbs whose frequencies fall between one in one hundred and one in one thousand—verbs like drink or speak—have a half-life of roughly 5,400 years. This is comparable to the half-life of carbon-14 (5,715 years), the isotope that is most famously used in dating ancient relics.
一旦计算出不规则动词的半衰期,就可以预测它们的未来。基于以上分析,我们预测,当begin、break、bring 、 buy、choose、draw、drink、drive、eat、fall这几个动词集合中的一个动词被规则化时,bid、dive、heave、shear、shed、slay、slit、sow、sting、stink这几个动词集合中的五个动词就已经被规则化了。并且,如果目前的趋势保持不变,到2500年,我们177个不规则动词中将只有83个仍然是不规则动词。
Once you’ve calculated the half-life of irregular verbs, it’s possible to make predictions about their future. Based on the above analysis, we predicted that by the time one verb from the set begin, break, bring, buy, choose, draw, drink, drive, eat, fall regularizes, five verbs from the set bid, dive, heave, shear, shed, slay, slit, sow, sting, stink will have already regularized. And that if current trends hold up, only 83 of our 177 irregular verbs will still be irregular in the year 2500.
我们对此感到非常兴奋,我们将我们的预测总结成一个简短的故事:
We were so excited about this that we summed our predictions up as a short story:
他是一位来自26世纪的有教养的人,所以当别人说他的语法烂透了的时候,他真的挺难受的。“烂透了,”时间旅行者纠正道。
He was a well-breeded man from the twenty-sixth century, so it really stinged when they said his grammar stunk. “Stinked,” the time-traveler corrected.
If you’re planning on doing some time travel anytime soon, you’d do well to memorize this instructive tale.
我们还可以预测特定动词的命运。在共同生活了数千年之后,当今哪个不规则动词最有可能抛弃当前的伴侣,去追求更年轻的伴侣?矛盾的是,答案是wed/wed,它是现代不规则动词中最不常见的。现在,wed/wedded已经频繁出现在公共场合。现在是你成为新婚夫妇的最后机会。未来的已婚夫妇只能期盼婚姻的幸福。
We could also anticipate the fate of particular verbs. After thousands of years together, which of today’s irregular verbs is most likely to abandon its current conjugal partner in pursuit of a younger model? Paradoxically, the answer is wed/wed, least frequent of the modern irregular verbs. Already, wed/wedded are frequently spotted in public. Now is your last chance to be a newly-wed. The married couples of the future can only hope for wedded bliss.
最后,我们可以回答那个让我们踏上旅程的童真问题了。
And finally, we could answer the childlike question that had started us off on our journey.
“为什么我们说driving而不是driven?”
“Why do we say drove and not drived?”
我们之所以仍然使用“driving” (成群结队) ——尽管我们已经放弃了其他不规则形式,例如“thrive”(兴盛),是因为“drive”比“thrive”更常见。在任何一个世纪,像“throve”这样的动词被规则化的可能性大约是像“driving”这样的动词的五倍。当然,如果英语存续的时间足够长, driving最终也会消失。我们估计,在“drive ”一词逐渐消亡之前,我们还有大约7800年的时间。孩子们在未来很长一段时间里都会对此感到好奇。
The reason we still say drove—whereas we’ve abandoned other irregular forms, like thrive, in droves—is that drive is far more frequent than thrive. In any given century, verbs like throve are about five times as likely to regularize as verbs like drove. Of course, drove, too, will eventually disappear, if English survives long enough. Our estimates suggest that we still have about 7,800 years before drove drives off into the sunset. Kids will keep on wondering about it for a long time to come.
哈佛园的中心,矗立着一座纪念约翰·哈佛的大型铜像。这座铜像整体色调暗淡,唯独左脚的那只鞋始终闪闪发亮。不知何故,用手触摸这只鞋拍照,成了每位到访哈佛的游客的必做之事。
In the center of Harvard Yard, there is a big statue commemorating the life of John Harvard. The bronze figure has a dull coloration, except for the left shoe, which always looks shiny. For some reason, taking a picture with your hand touching this shoe has become an item on the to-do list of every tourist who visits Harvard.
约翰·哈佛的鞋子为何如此闪亮?大多数人认为,这座雕塑最初创作时,包括鞋子在内的整个表面都是暗淡的青铜色,后来经过成千上万的参观者之手的打磨,才露出了闪亮的表面。
Why is John Harvard’s shoe so shiny? Most people think that when the sculpture was originally created, the entire façade—including the footgear—was a dull bronze, and that gradual polishing by thousands of visiting hands first exposed the shoe’s gleaming surface.
但青铜本身就是一种天生闪亮的金属。一个多世纪前,这座雕塑最初铸造时,也和其他青铜雕塑一样,闪闪发光。雕塑最外层(被称为铜锈)的光泽缺失,是由于自然风化、修复工作,甚至艺术家本人造成的腐蚀。如今,只有那只鞋,因为成千上万的路人频繁的擦刷,才保留了金属的本色。
But bronze is a naturally shiny metal. When it was originally cast, more than a century ago, this sculpture—like any other bronze sculpture—was shiny, too. The absence of luster, a topmost layer known as the sculpture’s patina, is a result of corrosion brought about by natural weathering, by restoration efforts, and even by the artist himself. The metal’s true color survives only in that shoe, thanks to the frequent brush of thousands of passersby.
不规则动词就是这样。初次遇到它们时,你会疑惑:这些奇怪的例外是怎么来的?但事实上,如今的不规则动词遵循的模式与许多世纪前相同。随着周围语言的演变,频繁的接触保护了不规则动词免受侵蚀。它们是我们刚刚开始理解的进化过程的化石。如今,我们把所有其他动词都称为规则动词。但规则性并非语言的默认状态。规则是千百种例外的墓碑。
The irregular verbs are just like this. When you first encounter them, you wonder, How did these strange exceptions get here? But in fact, the irregular verbs obey the same patterns today that they obeyed many centuries ago. As the language around them changed, frequent contact protected the irregulars from corrosion. They are fossils of an evolutionary process that we are just beginning to understand. Today, we call all those other verbs regular. But regularity is not the default state of a language. A rule is the tombstone of a thousand exceptions.
詹姆斯·乔伊斯《尤利西斯》词汇索引的出版堪称一大胜利,体现了多年的坚持和对细节的关注。尽管索引编制有着极其悠久而辉煌的历史,但在1937年出版时,这类索引只适用于最重要的书籍。例如,最古老的希伯来圣经索引,即马索拉,写于一千多年前。
The Word Index to James Joyce’s Ulysses was a triumph, reflecting years of perseverance and attention to detail. At the time it was published, in 1937, such indices were available for only the most important books, despite the fact that concordance writing has an extremely long and illustrious history. For instance, the oldest concordances of the Hebrew Bible, known as the Masorah, were penned more than a thousand years ago.
1946年,情况开始发生变化。那一年,一位名叫罗伯托·布萨(Roberto Busa)的耶稣会神父萌生了一个绝妙的想法。布萨是一位研究多产神学家托马斯·阿奎那的学者,他想要一本阿奎那著作的索引,以帮助他的研究。当时,计算机技术正处于蓬勃发展的时期,布萨认为或许可以用一种新的方式编写索引,那就是将一本书的原始文本输入到这些新机器中。他直接向IBM提出了自己的想法。IBM听取了他的意见,并决定支持他的努力。经过30年的努力和IBM的大力帮助,布萨的计划最终奏效了:巨著《托米斯提库斯索引》于1980年问世。这部著作给学术界留下了深刻的印象,如同汉利的《托米斯提库斯索引》一样,布萨的《托米斯提库斯索引》最终催生了一个全新的领域。如今,这一领域被称为数字人文学科,其研究方向致力于探索计算机如何与历史、文学等传统人文学科产生关联。
Things began to change in 1946. That year, a Jesuit priest named Father Roberto Busa had a powerful idea. Busa, a scholar of the prolific theologian Thomas Aquinas, wanted a concordance of Aquinas’ work to help him with his studies. Computer technology was beginning its meteoric ascent, and Busa thought it might be possible to create a concordance in a new way, by feeding the raw text of a book into one of these new machines. He took his idea straight to IBM. The company heard him out and decided to support his efforts. It took thirty years and lots of IBM’s help, but Busa’s plan eventually worked: The monumental Index Thomisticus was completed in 1980. The world of scholarship was impressed, and like Hanley’s Index, Busa’s Index ultimately gave rise to a new field. Known today as the digital humanities, work in this area concerns itself with all the ways in which computers can be relevant to traditional humanistic enterprises like history and literature.
尽管这些索引影响非凡,人们很容易将其视为昙花一现。不久之后,现代计算机的蓬勃发展意味着创建索引只需一行代码,易于编写且可立即运行。当赖默发表了她称之为“传奇、词汇、饶舌的爱情”的字母实验时——本质上是一个索引,但省略了页码——索引本身只值得简短的致谢。如今,学者们很少费心去编写新的索引。没有必要,因为一台廉价的笔记本电脑几乎可以立即在长文本中搜索某个单词的所有实例。表面上看,索引的时代已经终结了。
Despite the extraordinary influence of these indices, it is easy to think of them as a swan song. It was not long before the burgeoning power of modern computers meant that creating a concordance took only a single line of code, easy to write and instantaneous to run. By the time Reimer published the alphabetical experiment she called Legendary, Lexical, Loquacious Love—essentially a concordance, but with page references left out—the concordancing itself merited only a brief acknowledgment. Today, scholars rarely bother to make new concordances. There’s no need, since a cheap laptop computer can search a long text for all instances of a word almost instantaneously. On the surface, the age of concordances has come to an end.
然而,如果你揭开现代科技的面纱,你发现的东西可能会让你大吃一惊。当今世界正由互联网搜索引擎——迄今为止最强大的信息查找工具——驱动着。什么是搜索引擎?搜索引擎的核心是一个包含单词和这些单词出现的网页的列表。每个小小的白色搜索框背后都隐藏着一个巨大的数字索引。
Yet if you pop the hood of modern technology, what you find underneath may surprise you. Today’s world is kept humming by Internet search engines, the most powerful information-finding tools ever developed. What is a search engine? At its core, a search engine is a list of words and the pages on the Web on which those words appear. Hiding behind every little white search box is a massive digital concordance.
布萨之后,索引并没有消失,相反,它们风靡全球。
Concordances didn’t die out after Busa. Instead, they took over the world.
齐普夫是一位杰出的人物,他的工作改变了众多领域,其中大多数与他的专业领域相去甚远。从语言到生物学,从城市规划到奶酪物理学,如今的科学家很难不接触到齐普夫留下的宝贵遗产。在我们自己的研究中,齐普夫为我们揭开语言进化的秘密提供了所需的线索。
Zipf was a remarkable man whose work transformed numerous fields, most of them distant from his own expertise. From language to biology, urban planning to the physics of cheese, it’s hard to be a scientist today without encountering Zipf’s legacy. In our own work, Zipf provided the clue we needed to begin uncovering the secrets of language evolution.
这位古怪的德国文学学者究竟有何过人之处,使得他在科学上如此具有预见性?
What was it about this oddball scholar of German literature that made him, scientifically speaking, so prophetic?
认知心理学的创始人之一乔治·A·米勒曾对齐普夫做出过如下评价,我们认为这在很大程度上解答了这个问题。米勒说齐普夫是那种“会把玫瑰掰开数花瓣”的人。表面上看,这番话并不怎么好听。齐普夫真的如此痴迷于数数,以至于无法欣赏花朵的美吗?
George A. Miller, one of the founders of cognitive psychology, once gave the following take on Zipf, and we think it goes a long way toward answering that question. Miller said that Zipf was the kind of man who would “take roses apart to count their petals.” On the surface, this doesn’t sound terribly flattering. Was Zipf so obsessed with counting that he was unable to appreciate the beauty of a flower?
当然不是。齐普夫是一位杰出的文学学者,他深刻领悟了书籍的美与力量,这朵文学天才之花。然而,齐普夫的与众不同之处在于,他并没有被这种美迷得神魂颠倒,而忽略了欣赏花朵的其他方式。其中一种方式恰好就是将花朵拆开。
Certainly not. Zipf was a prominent scholar of literature, someone who deeply grasped the beauty and power of the book, the flower of literary genius. What made Zipf different, though, was that he wasn’t so transfixed by this beauty as to be blind to the other ways in which a flower could be appreciated. One of those ways happened to involve taking the flower apart.
在齐普夫之前,一本书是逐行逐页地阅读、理解和思考的。你仿佛置身于盛放的玫瑰之中,感受着它的整体魅力。即使是汉利,他的索引也曾为齐普夫的阅读之旅提供助力,但他的著作也旨在为传统的阅读提供辅助。
Before Zipf, a book was something that was read, understood, and contemplated line by line and page by page. You took in the whole gestalt, like a rose in full bloom. Even Hanley, whose index had facilitated Zipf’s journey, intended his work as an aid for traditional reading.
但齐普夫这个奇特问题背后,却蕴含着一个关于书籍本质的全新概念。这个问题反映了他非凡的直觉:一种替代性的阅读方式是可能的:分析剥离花朵语境的文本小花瓣,寻找数学设计的证据。
But embedded in Zipf’s peculiar question was a radical new notion of what a book could be. The question reflected his marvelous intuition that an alternative form of reading was possible: analyzing little petals of text, stripped of their floral context, to look for evidence of mathematical design.
在过去的一个世纪里,科学家们一直在追寻这一开创性见解的踪迹。当我们完成对动词的分析时,我们自豪地认为自己是其中的一员。但事实上,我们当时仍然过于关注不规则动词的细节,以至于无法真正体会到齐普夫方法的威力。
For the last century, scientists have been following the trail of this pioneering insight. By the time we had finished our analysis of verbs, we were proud to count ourselves among their number. But in truth, we were still too caught up with the particulars of the irregular verbs to really appreciate the power of Zipf’s approach.
这种情况很快就会改变。毕竟,齐普夫仅仅通过摘取一捧花就揭示了令人惊叹的科学视野。现在,多亏了谷歌,整个图书馆正在一个接一个地被数字化。我们想尝试齐普夫的做法。但我们想要所有的花。
That would soon change. After all, Zipf had revealed breathtaking scientific horizons by picking apart a mere handful of flowers. Now, thanks to Google, entire libraries were being digitized, one after another. We wanted to try what Zipf had done. But we wanted all the flowers.
秒一位年轻的法国人在自己的祖国学习英语时,发现某些动词的过去式拼写不同。这些被错用的动词在教科书中被单独列成一节,甚至在不规则动词中也被单独列出。尽管学习所有这些动词非常费劲,但他还是坚持了下来,记住了那些过去式由加-t而不是-ed构成的单词表。
Studying English in his native land, a young Frenchman learnt that certain verbs were spelt differently in the past tense. These spoilt verbs dwelt in their own section of the textbook, singled out even among the irregulars. Although it was a real pain in the neck to learn them all, he soldiered through, memorizing the list of words whose past tense was formed by adding -t instead of -ed.
当他终于抵达美国时,这位学生对自己掌握的语言充满信心。但抵达后不久,在阅读媒体上关于伦敦奥运会的报道时,他我很惊讶地看到《华盛顿邮报》的标题是:“精疲力竭的菲尔普斯在与罗切特的比赛中失利。”正如每个法国人所学到的,动词burn是不规则的。迈克尔·菲尔普斯应该感到筋疲力尽。这些美国报纸难道没有文字编辑吗?
When he finally entered the United States, the student was brimming with confidence in his mastery of the language. But shortly after his arrival, while reading about the London Olympics in the press, he was surprised to see the following headline in the Washington Post: “Burned-Out Phelps Fizzles in the Water against Lochte.” As every Frenchman is taught, the verb burn is irregular. Michael Phelps should have felt burnt out. Didn’t these American papers have copy editors?
几天后,他又看到《洛杉矶时报》上另一个令人痛心的标题:“科比·布莱恩特说他从菲尔·杰克逊那里学到了很多东西。” 这名学生对菲尔·杰克逊一无所知,但科比竟然从菲尔那里学到了东西,这让他感到震惊。如果真要说的话,他应该学到的。
A few days later, he saw another distressing headline, this one in the Los Angeles Times: “Kobe Bryant Says He Learned a Lot from Phil Jackson.” The student knew nothing about Phil Jackson, but was still shocked that Kobe had learned from Phil. If anything, he should have learnt.
渐渐地,这位学生意识到,在这条规则上,所有美国人都犯了同样的错误。他知道大多数美国人说法语听起来很滑稽,但从他的教科书来看,他们的母语也同样糟糕。他觉得事情有点不对劲。
Little by little, the student realized that when it came to this particular rule, all Americans were making the same mistakes. He knew that most Americans sounded ludicrous when they spoke French, but to judge from his textbooks, they were equally bad at their native tongue. He smelt a rat.
幸运的是,他获得了一种新的视野。很快真相就暴露了:他在法国之前一直在浪费时间。他感到很受伤。
Fortunately, he had access to a new kind of scope. It soon spilt the beans: He had been wasting his time back in France. He felt burnt.
发生了什么?因为动词burn/burnt、dwell/dwelt、learn/learnt、smell/smelt、spell/spelt、spill/spilt和spoil/spoilt都遵循类似的模式,它们在英语使用者的思维中相互支撑。因此,它们长期以来一直不规则——比你根据它们各自的频率预期的要长。
What happened? Because the verbs burn/burnt, dwell/dwelt, learn/learnt, smell/smelt, spell/spelt, spill/spilt, and spoil/spoilt all follow a similar pattern, they prop each other up in the minds of English speakers. As a result, they have been irregular for a very long time—longer than you would expect from their individual frequencies.
这些动词在许多教科书中仍然作为不规则动词出现。但实际上,曾经强大的联盟正在瓦解。其中两个成员,spoil和learn,在 1800 年就被规则化了。此后,又有四个动词被规则化:burn、smell、spell和spill。
These verbs still appear as irregular in many textbooks. But in reality, the once-mighty alliance is coming apart. Two members, spoil and learn, regularized by 1800. Four more have regularized since then: burn, smell, spell, and spill.
研究结果表明,这种趋势起源于美国。但后来蔓延到了英国,那里每年有高达英国剑桥采用burned代替burnt。如今,只有dwelt仍然在不规则形状中存在。
The results suggest that this trend originated in the United States. But it has since spread to the United Kingdom, where each year, a population the size of Cambridge, England, adopts burned in lieu of burnt. Today, only dwelt still dwells among the irregulars.
总而言之:这名学生不应该因为英语课程而感到疲惫。他应该感到疲惫。
In conclusion: The student was wrong to feel burnt by his English language courses. He should have felt burned.
空谈词典学家
ARMCHAIR LEXICOGRAPHEROLOGISTS
B2007 年,我们对不规则动词的接触使我们相信,统计单词数量可以追踪某些文化随时间的变化。但追踪不规则动词很容易,因为它们非常常见。例如,单词went大约每五千个单词或每二十页出现一次。在你读的每本书中,你都会反复看到它。但当我们超越不规则动词,试图更广泛地追踪单词时,很快就会触及齐普夫定律的阴暗面。频繁出现的单词(如went)数量非常少。绝大多数单词都极其罕见。
By 2007, our encounter with irregular verbs had convinced us that counting words made it possible to track certain kinds of cultural change over time. But tracking irregular verbs is easy, because they are so frequent. The word went, for instance, appears about once every five thousand words, or roughly every twenty pages. You see it repeatedly in every book you read. But as one ventures beyond the irregular verbs, trying to track words more generally, one soon runs into the dark side of Zipf’s law. The words that are frequent (like went) are very few in number. The vast majority of words are exceedingly rare.
假设我们试图追踪一些更具挑战性的东西,比如被称为大脚怪。“大脚怪”这个难以捉摸的词,大约每千万个单词,或者说每百本书,才会在英语文本中出现一次。追踪大脚怪的出现比追踪典型的不规则动词要困难得多。
Suppose we were trying to track something a bit more challenging, like the abominable snowman known as the Sasquatch. The elusive Sasquatch appears in English texts approximately once in every ten million words, or roughly once every hundred books. Tracking down the Sasquatch is much, much harder than tracking the typical irregular verb.
不过,就文化概念而言,大脚怪并不难找。尼斯湖水怪则更难捉摸——每两百本书才出现一次。但如果你想真正考验自己追踪神秘生物词汇的勇气,不妨试试找一个卓柏卡布拉。这种吸血生物于1995年在波多黎各首次被发现。人们对其了解不多。但我们可以告诉你:卓柏卡布拉比大脚怪稀有得多。每1.5亿个单词(约1500本书)才有可能看到一次卓柏卡布拉。一个博览群书的人,一生中可能只会见到一次卓柏卡布拉。这就是最后一次:卓柏卡布拉。珍惜这一刻。
Still, as cultural concepts go, the Sasquatch isn’t very hard to find. The Loch Ness monster is more elusive—only one appearance every two hundred books. But if you want to really test your mettle as a lexical tracker of cryptic creatures, try finding a Chupacabra. The blood-drinking creature was first spotted in 1995 in Puerto Rico. Not much more is known. But we can tell you this: A Chupacabra is much rarer than a Sasquatch. There’s a sighting just once in every 150 million words, or about 1,500 books. An extremely well-read person might see a Chupacabra once in his or her entire life. Here it is, one last time: Chupacabra. Savor this moment.
要追踪这样的词汇,我们需要数百万本书籍:大数据。而我们能从只有一个地方获取它。
To track words like that one, we’d need millions of books at our disposal: big data. And there was only one place we could go to get it.
2002年,谷歌发展如日中天,联合创始人拉里·佩奇也有一些空闲时间。他想做什么呢?毕竟,谷歌的使命是“整合全球信息”,而佩奇知道书籍中蕴藏着丰富的信息。
In 2002, Google was going great guns, and cofounder Larry Page had some free time. What to do? Google’s mission is, after all, to “organize the world’s information,” and Page knew that there’s a lot of information in books.
他开始思考:将一个实体图书馆改造成一个可以在网络空间生存的数字图书馆有多难?没有人知道。因此,佩奇和玛丽莎·梅耶尔(时任谷歌产品经理;2013 年起担任雅虎首席执行官)决定做一个实验,用节拍器帮助他们保持翻阅一本三百页书的速度。实验花了四十分钟。按照这个速度,翻阅一个藏书量达七百万册的图书馆(比如佩奇的母校密歇根大学图书馆)大约需要五百年。当然,密歇根大学收藏的书籍只是所有书籍的一小部分。翻阅世界上所有的书籍(你需要将每一页都扫描成机器可读的形式)将需要几千年,甚至数万年。这似乎是不可能的。
He began to wonder: How hard would it be to transform a physical, brick-and-mortar library into a digital one that could live in cyberspace? No one knew. So Page and Marissa Mayer (then a product manager at Google; as of 2013, the CEO of Yahoo!) decided to do an experiment, using a metronome to help them keep the pace as they turned the pages of a three-hundred-page book. It took forty minutes. At that rate, just flipping through the pages of a seven-million-volume library, such as that of Page’s alma mater, the University of Michigan, would take about five hundred years. And of course, the University of Michigan has only a fraction of all books. Flipping through the pages of all the world’s books—as you would need to do in order to digitally scan each page into a machine-readable form—would take millennia, even eons. It seemed impossible.
当然,你的思维方式不像一个29岁的亿万富翁。对于一个互联网巨头,一家公司即将跻身财富500强的企业来说,人-万年是一种可以买到的商品。
But of course, you’re not thinking like a twenty-nine-year-old billionaire. To a giant of the Internet boom whose company would soon enter the Fortune 500, a person-eon is a commodity that you can buy.
因此,当密歇根大学校长玛丽·苏·科尔曼告诉佩奇,将大学书籍完全数字化需要一千年时,佩奇提供了谷歌的服务,并表示这项任务可以在六年内完成。
So when the University of Michigan’s president, Mary Sue Coleman, told Page that completely digitizing the university’s books would take a thousand years, Page offered Google’s services and suggested that the task could be completed in six.
于是,谷歌启动了一个项目,将有史以来的每一本书都数字化,将所有书籍汇集到一个图书馆,并将其加载到计算机硬盘上。
And with that, Google began a project to digitize every single book ever written—to assemble a library of everything, and load it onto a computer hard drive.
在谷歌着手采购和扫描所有书籍之前,该公司需要一份购物清单,以便追踪需要采购的书籍以及已经扫描的书籍。因此,谷歌从数百家图书馆和公司收集了图书目录信息,然后将这些目录合并,创建了一份清单,其中包含谷歌所能提供的每本有史以来出版的书籍。(或者更准确地说,是每本流传至今的书籍。例如,亚历山大图书馆被烧毁时丢失的书籍不计入总数。)最终的购物清单包含1.3亿册书。
Before Google could go about acquiring and scanning all the books, the company needed a shopping list to help keep track of which books it needed to get and which it had already scanned. So Google collected book catalog information from hundreds of libraries and companies, and then merged these catalogs to create a list containing, as best Google could tell, an entry for every book ever written. (Or, more precisely, for every book that has survived into the present day. The books lost when the Library of Alexandria burned down, for instance, don’t count in this total.) The resulting shopping list contained 130 million books.
接下来,谷歌需要采购并扫描每本书。在某些情况下,出版商会直接从印刷厂寄送副本。在这种情况下,谷歌会“破坏性地”扫描书籍:员工会剪掉装订线,然后以极高的速度逐页扫描,将图像存储为可在计算机上轻松查看的数字格式。对于其余的书籍,该公司联系了世界各地的图书馆,一次检查书架、部分、侧楼,甚至整栋建筑。与所有图书馆书籍一样,这些书籍需要归还——即使是谷歌也无法承担所有这些逾期图书费用。因此,谷歌开发了一种无损扫描系统也是如此:一小队翻页器,跟随佩奇和梅耶的脚步,被雇佣来整天翻阅着书页,相机不停地拍摄着文本的图像。在过去的十年里,这支势不可挡的扫描队伍已经翻阅了数十亿次书页。偶尔,其中一张图片中就会出现一个暴露真相的拇指。
Next, it needed to acquire and scan each book. In some cases, publishers sent copies straight from the presses. In this situation, Google would scan the books “destructively”: Employees would cut off the binding and scan the pages in, one after another, at very high speed, storing the images in a digital format that could easily be viewed on a computer. For the rest of the books, the company reached out to libraries around the world, checking out shelves, sections, wings, and even whole buildings at a time. Like all library books, the volumes needed to be returned—even Google couldn’t hope to afford all those late-book fees. So Google developed a nondestructive scanning system, too: A small army of page turners, following in Page and Mayer’s footsteps, was hired to turn pages all day long while cameras snapped images of the text. In the last decade, this unstoppable scanning squadron has turned the page billions of times. Every once in a while, a telltale thumb appears in one of the images.
最后,通过一种名为光学字符识别(OCR)的程序,计算机程序会查找并识别图像中包含的字母,将数字化图像转换为原始文本。最终得到的是一个包含整本书内容的文本文件,类似于你在文字处理器中输入的内容。
Finally, using a process called optical character recognition, in which a computer program finds and identifies the letters contained in an image, the digitized images are transformed into raw text. The result is a text file—akin to what you might produce when typing in a word processor—that contains the entire book.
谷歌的数字化努力取得了非凡的成功,这无疑是这位29岁亿万富翁逻辑的重大胜利。在佩奇与梅耶尔合作十年之后,以及在他公开宣布该项目九年后,谷歌已经将超过三千万本图书数字化。
In a major triumph for twenty-nine-year-old billionaire logic, Google’s digitization efforts have been extraordinarily successful. Ten years after Page flipped pages with Mayer and nine years after he publicly announced the project, Google has digitized more than thirty million books.
如此庞大的文本集合只有计算机才能分析。如果人类试图以每分钟两百字的合理速度阅读,不吃饭不睡觉,那么需要两万年才能读完。
Such a vast collection of text can only be analyzed by computer. If a human tried to read it, at the reasonable pace of two hundred words per minute, without interruptions for food or sleep, it would take twenty thousand years to finish.
可以将这些数据视为对整个图书记录的民意调查。为了了解这项调查的全面性,我们可以想象一下,美国登记选民的数量(1.37亿)大约相当于有史以来出版的书籍总数(1.3亿)。盖洛普民意调查在2012年总统大选前五天发布,调查了2700名潜在选民,约占五万分之一。谷歌的图书调查涵盖了3000万本图书,约占四分之一。就民意调查而言,它的覆盖面极其广泛:对人类文化记录进行了前所未有的概述。
One way to think of this data is as a poll of the entire book record. To get a sense of how comprehensive this poll is, consider that there are about as many registered voters in the United States (137 million) as the total number of books ever published (130 million). The Gallup poll released five days ahead of the 2012 presidential election surveyed 2,700 likely voters, about 1 in 50,000. Google’s poll of all books includes 30 million books, or about 1 in 4. As polls go, it is incredibly comprehensive: an unprecedented précis of humanity’s cultural record.
因为我们负担不起自己的人力,所以显然我们需要加入谷歌的行动。但是该怎么做呢?
Because we could not afford our own person-eon, it was clear that we needed to get in on the action at Google. But how?
2007年,埃雷兹的妻子阿维娃·艾登(Aviva Aiden)受邀前往谷歌总部领取计算机科学女性奖,机会来了。埃雷兹也跟着去了彼得·诺维格(Peter Norvig),谷歌著名研究主管。
Opportunity knocked when, in 2007, Aviva Aiden, Erez’s wife, was invited to the Googleplex—Google’s headquarters—to receive an award for women in computer science. Erez tagged along and made his way to the office of Peter Norvig, Google’s famed director of research.
诺维格是人工智能领域的先驱。他撰写了该领域的标准教科书。他的演讲总是引人关注,许多人都会认真聆听。例如,2011年秋季,诺维格和塞巴斯蒂安·特伦开设了世界上第一个大规模开放在线课程(MOOC)。这门由斯坦福大学赞助的人工智能课程取得了巨大成功:超过16万名学生注册。它引发了高等教育的一场革命。
Norvig is a pioneer of artificial intelligence. He wrote the standard textbook on the subject. And when he talks, people listen. Many, many people listen. For instance, in the fall of 2011, Norvig and Sebastian Thrun taught the world’s first massive open online course, or MOOC. Presented under the auspices of Stanford University, their artificial intelligence course was a runaway success: More than 160,000 students enrolled. It set off a revolution in higher education.
这让他的会议方式令人意外。诺维格不爱多言。事实上,唯一比谷歌的电子书更难读懂的,就是诺维格听你说话时那副令人费解的扑克脸。最后,过了一会儿,他通常会说出一些要么非常有见地的话,要么完全不着边际的废话。这样,你就能知道自己的论点是否站得住脚了。
That makes his approach to meetings surprising. Norvig does not like to say much. In fact, the only thing harder to read than Google’s digital books collection is Norvig’s impenetrable poker face as he listens to you talk. Finally, after some time, he typically says something that is either very insightful or a complete non sequitur. With that, you know if you’ve succeeded in making your case.
在听完埃雷兹长达一小时的演讲后,诺维格终于亮出了自己的底牌。
After listening to Erez present our hour-long pitch, Norvig finally showed his cards.
“这听起来很棒,但我们怎样才能在不侵犯版权的情况下做到这一点呢?”
“This all sounds great, but how can we do it without violating copyright?”
2004年,谷歌公开宣布计划将全球所有图书数字化,出版业因此感到紧张,这在情理之中。如果他们的图书可以在网络上搜索到,这对出版业意味着什么?谷歌打算与公众分享哪些内容?即使谷歌愿意遵守版权法,该公司又该如何确定每本书的版权归属?谷歌会不会像苹果iTunes颠覆音乐行业一样,彻底颠覆整个行业?
When Google publicly announced its intention to digitize all the world’s books in 2004, the publishing industry became—understandably—nervous. What would it mean for them if their books were to become searchable on the Web? Which content did Google intend to share with the public? Even if Google wanted to obey copyright law, how could the company figure out who held the rights to any given book? Would Google just overthrow the whole industry, as Apple’s iTunes had done with music?
很快,诉讼开始纷至沓来。2005 年 9 月 20 日,代表大量个人作家的美国作家协会提起集体诉讼。10 月 19 日,代表大型出版商麦格劳希尔、企鹅美国、西蒙与舒斯特、培生教育和约翰威利的美国出版商协会也提起了诉讼。这两起诉讼都指控谷歌“大规模侵犯版权”。2006 年,法国和德国出版商也加入了战局。到 2007 年 3 月,谷歌的竞争对手也纷纷加入进来。微软的顶级律师之一托马斯·鲁宾发表了一系列事先准备好的讲话,严厉批评谷歌的数字化努力,称谷歌的做法“系统性地侵犯版权”并“破坏了创作的关键动力”。谷歌图书项目迅速成为大数据历史上最重要的法律热点之一。
Soon, lawsuits began to pour in. On September 20, 2005, the Authors Guild, representing a huge number of individual authors, filed a class-action lawsuit. By October 19, the American Association of Publishers, representing megapublishers McGraw-Hill, Penguin USA, Simon & Schuster, Pearson Education, and John Wiley, filed its own lawsuit. Both suits alleged “massive copyright infringement.” In 2006, French and German publishers joined the fray. By March 2007, Google’s competitors were piling on, too. Thomas Rubin, one of the top attorneys at Microsoft, delivered a set of prepared remarks blasting Google’s effort at digitization, saying that Google’s approach “systematically violates copyright” and “undermines critical incentives to create.” The Google Books project was rapidly becoming one of the most important legal flashpoints in the history of big data.
谷歌图书的麻烦预示着大数据研究未来将面临的法律挑战。最有趣的大数据往往掌握在大型企业手中——谷歌、Facebook、亚马逊和推特……全世界。掌握在某人手中,但不一定归其所有。数据通常来自个人,无论是因为他们写了一本书、创建了一个网页,还是拍了一张照片。这些人对数据拥有重大权利——他们理应如此,因为这是他们的创作。这些权利可以采取版权、隐私权、知识产权或其他一系列权利的形式。因此,数据既不是公开的,也不是私人的。相反,它构成了一个共享的数字公地,一个无人区,数百万人可能在其中拥有利益,没有任何实体拥有完全的权威,而且法律地位通常模糊不清。
Google Books’ troubles are a harbinger of the legal challenges that big data research will face going forward. The most interesting big datasets are frequently in the hands of massive corporations—the Googles, Facebooks, Amazons, and Twitters of the world. In the hands of, but not necessarily owned by. The data typically comes from individual people, whether it’s because they wrote a book, put up a Web page, or took a picture. Those people retain significant rights over the data—as well they should, since it is their creation. These rights can take the form of copyrights, or privacy rights, or intellectual property rights, or a litany of other rights. So the data isn’t public, but it isn’t private, either. Instead, it comprises a shared digital commons, a no-man’s-land in which millions of people may have an interest, no entity has complete authority, and legal status is often obscure.
对科学家来说,这无疑是一次颠覆性的变革。我们已经习惯了这样一个世界:生成或获取数据,然后随心所欲地进行分析。科学家最多只需要获得伦理委员会的批准。但传统的研究方法,会使我们在引言中提到的每一项大数据研究——从莱文对eBay的分析到巴拉巴西对手机移动轨迹的研究——都变得非法且不道德。在大数据的世界里,获取所有信息然后再进行分析,无论在实践上还是道德上都是不可能的。如果没有人愿意——甚至没有权利——交出大数据,我们该如何利用它呢?
For scientists, this is a game changer. We have gotten used to a world in which we generate or obtain data and then analyze it however we want. At most, a scientist might need to get approval from an ethics panel. But the traditional approach would make each of the big data studies we mentioned in our introduction—from Levin’s analysis of eBay to Barabási’s study of cell phone movements—illegal and unethical. In the world of big data, the notion of getting everything and analyzing it later is a practical and moral impossibility. How can we take advantage of big data, if no one is willing—or even has the right—to hand it over?
诺维格的问题直指关键问题。
Norvig’s question had zeroed in on the crucial issue.
要求谷歌直接提供世界书籍的全文,这根本行不通。幸运的是,我们不需要这么做。
Asking Google to just hand us the full text of the world’s books was going to be a nonstarter. Fortunately, we didn’t need to.
这是因为大数据会投射出巨大的阴影。正如阴影是真实物体的暗投影一样——一种视觉变换,它保留了原始物体的某些方面,同时滤除了另一些方法——影子数据保留了部分原始信息,但并非全部。虽然影子数据更像是一门艺术而非科学,但它对于大数据处理过程中的进展至关重要。错误的影子数据可能在伦理上存在争议,在法律上难以处理,在科学上也毫无用处。但如果你选择正确的角度,就有可能掩盖原始数据集中法律和伦理上的敏感部分,同时保留其非凡的威力。
That’s because big data casts big shadows. Just as a shadow is the dark projection of a real object—a visual transformation that preserves some aspects of the original object while filtering out others—shadow data preserves some, but not all, of the original information. Though shadowing is more art than science, it’s crucial to making progress when working on big data. The wrong shadow can be ethically dubious, legally intractable, and scientifically useless. But if you choose exactly the right angle, it’s possible to obscure the legally and ethically sensitive parts of the original dataset while retaining much of its extraordinary power.
如果你非常幸运,隐藏数据集可能很容易。例如,大数据集的问题通常在于它会泄露敏感的个人信息。如果是这样,删除与每条记录相关的人名似乎就足够了。但这很少如此简单。问题是,许多大数据集的信息量如此之大,以至于仔细检查后发现,为每条记录附加一个名字是多余的:记录本身包含如此多的识别特征,以至于地球上只有一个人可以描述它。在这种情况下,删除名字并没有多大用处。
If you’re very lucky, shadowing a dataset can be easy. For instance, often the problem with a big dataset is that it exposes sensitive personal information. If so, erasing the name of the person associated with each record seems like it ought to be enough. But it’s rarely so simple. Trouble is, many big datasets are so information-rich that attaching a name to each record is, on closer examination, redundant: The record itself contains so many identifying characteristics that there’s only one person on the planet it could describe. In such a case, removing the name doesn’t accomplish much.
2006年,美国在线(AOL)就曾为此付出惨痛代价。当时,它本应为科学研究做出巨大的贡献,却公开发布了超过65万用户的搜索日志。当然,AOL对这些日志进行了编辑:日志中没有包含用户姓名,每个用户的用户名都被替换成了一个不起眼的数值。AOL以为这样可以保护用户的隐私。但AOL却大错特错。
America Online learned this the hard way in 2006, when, in what was meant to be a magnanimous contribution to scientific research, it publicly released the search logs of more than 650,000 users. Of course, AOL redacted the logs: People’s names were not included in the release, and each user’s handle was replaced with a nondescript numerical value. AOL thought this would protect users’ privacy. But AOL was badly mistaken.
通过检查现已公开的搜索日志,并将其与其他广泛可用的数据进行交叉比对,《纽约时报》记者迈克尔·巴巴罗 (Michael Barbaro) 和汤姆·泽勒 (Tom Zeller, Jr.) 等人能够推断出用户身份。数据公布几天后,巴巴罗和泽勒注意到,在三个月内数百条其他查询中,用户 4417749 搜索了“园林设计师”佐治亚州利尔伯恩市的一位用户,以及许多姓“Arnold”的人。快速浏览电话簿后发现,该用户可能是一位住在利尔伯恩的62岁女士,名叫塞尔玛·阿诺德。当巴巴罗和泽勒联系上阿诺德女士,并向她朗读了她自己搜索日志中的一些查询内容时,她对美国在线的所作所为感到震惊:“我们都有隐私权。不应该有人发现这一切。”
By examining the now-public search logs and cross-referencing them with other widely available data, it was possible for people like New York Times journalists Michael Barbaro and Tom Zeller, Jr., to deduce user identities. Days after the data was released, Barbaro and Zeller noticed that, amid hundreds of other queries spanning a three-month period, user 4417749 searched for “landscapers in Lilburn, GA” and for many people whose last name was “Arnold.” A quick look through the phone book suggested that the user was probably a sixty-two-year-old lady living in Lilburn named Thelma Arnold. When Barbaro and Zeller contacted Ms. Arnold and read her some of the queries from her own search log, she was flabbergasted at what AOL had done: “We all have a right to privacy. Nobody should have found this all out.”
AOL 意识到了自己的错误,并试图纠正问题。在数据发布仅三天后,该公司就将其下线。该公司还道了歉,解雇了发布日志的研究人员及其上司。几周后,AOL 的首席技术官辞职。但为时已晚:数据已在网络上传播开来。由于 AOL 为促进研究所做的努力看似高尚,但执行不力,AOL 遭受了应得的负面宣传和集体诉讼。这次惨败成为大数据匿名化难度的经典案例——对于业内人士来说,这也是一个警示故事,警示公司在利他性数据共享时可能面临的风险。AOL 发布这些日志几乎没有获得任何好处,最终却付出了巨大的代价。诺维格也对此耿耿于怀。
AOL realized its mistake and tried to rectify the problem. Only three days after releasing the data, the company took it offline. It also apologized, fired the researcher who released the logs, and fired the researcher’s supervisor. A few weeks later, AOL’s CTO resigned. But it was too late: The data had already spread across the Web. Because of its high-minded but poorly executed effort to catalyze research, AOL was hit with a wave of well-deserved negative publicity and a class-action lawsuit. The debacle became a classic example of how hard it is to anonymize big data—and, to those in the industry, a cautionary tale of the dangers a company can face when it wades into altruistic data sharing. AOL stood to gain almost nothing by releasing those logs, and in the end it paid a great price. Norvig had this, too, in the back of his mind.
当然,姓名并非唯一可能损害数据集的因素。谷歌图书的问题恰恰相反。书中少数几个通常可以公开而不必担心诉讼的部分之一就是作者姓名。其余部分则受版权保护。
Of course, names are not the only thing that can make a dataset compromising. Google Books has the opposite problem. One of the few pieces of a book’s text that you can usually release without fear of a lawsuit is the name of its author. The rest of the book’s text is protected by copyright.
大影子如何帮助我们突破这一困境?要利用大数据,我们需要找到一个满足四个重要标准的影子。首先,这个影子需要保护数百万人的权利,正是他们共同努力创造了原始数据集。其次,它需要有趣。第三,它不能运行这与公司作为数据守门人的宗旨背道而驰。第四,它必须是人们在实践中能够真正生成的东西。美国在线的问题不在于它发布了用户搜索数据;问题在于它选择的影子掩盖的信息太少,导致严重违反了我们的第一个标准。杰里米·金斯伯格创建谷歌流感趋势时,也发布了来自用户搜索的信息。但他的影子以一种独特的方式汇总了这些数据,除了流感病毒之外,没有人受到伤害。
How can big shadows help us navigate this impasse? To make use of big data, one needs to find a shadow that satisfies four important criteria. First, the shadow needs to protect the rights of the millions of people whose collective efforts created the original dataset. Second, it needs to be interesting. Third, it cannot run counter to the purposes of the company, which serves as the data’s gatekeeper. Fourth, it needs to be something one can actually generate in practice. AOL’s problem was not that it had released data about user searches; the problem was that the shadow it chose obscured far too little, and led to an egregious violation of our first criterion. When Jeremy Ginsberg created Google Flu Trends, he too released information derived from user searches. But his shadow aggregated the data in such a way that no one was harmed—except for the influenza virus.
使用大型影子数据让我们能够在保护数据集信息的同时,确保其正常运作。受益的不仅仅是参与其中的研究人员。由于理想的影子数据在伦理和法律上都是无害的,因此通常可以说服谨慎的守护者将其公开。因此,大型影子数据让我们能够将高度保密的数据集转化为强大的公共资源,任何有奇思妙想的人,无论是科学家、人文学者、企业家还是高中生,都可以利用。当我们与企业交流时,我们喜欢将其描述为一种数据慈善的形式:捐赠比特数据与捐赠金钱数据一样好,而且从定义上来说,它更便宜。
Using big shadows gives us a way to protect the information in a dataset while still putting it to work. And it’s not just the researchers involved who stand to benefit. Because an ideal shadow is ethically and legally innocuous, it’s often possible to persuade its wary keepers to release it into the public domain. Thus, big shadows give us a way to transform highly guarded datasets into formidable public resources, usable by anyone with a bright idea, whether it’s a scientist, a humanist, an entrepreneur, or a high school student. When we’re talking to companies, we like to present this as a form of data philanthropy: Donating bits can be just as good as donating bucks, and it is, by definition, cheaper.
为简单起见,我们可以将 Google 图书的原始数据想象成一张长表,其中包含每本书的全文,以及书籍信息,例如书名、作者姓名和出生日期、原图书馆以及出版日期。Google 图书带来了哪些重大影响?很多。但并非所有影响都同样有前景。
For simplicity, think of the raw data of Google Books as one long table containing the full text of each book, coupled with information about the work, such as the book’s title, the author’s name and date of birth, the library of origin, and the date of publication. What big shadows are cast by Google Books? Many. But not all are equally promising.
一个影子只包含每本书的书名。这个影子包含大约一亿个单词。与完整的藏书相比,这些数据非常小,不足以支撑太多新的科学研究。但这仍然存在很大问题:谷歌将这些书名视为商业智能,因为该公司不希望竞争对手知道哪些书名已经扫描,哪些没有。因此,这些书名并不能构成一个好的影子。
One shadow consists of only the title of each book. This shadow includes about one hundred million words. The data is tiny in comparison to the full collection, and too small to enable much new science. But it’s still quite problematic: Google considers these titles to be business intelligence, because the company doesn’t want competitors to know which books it has scanned and which it has not. So the titles don’t make for a good shadow.
另一个影子是所有公有领域书籍的全文——所有版权已过期的书籍。这是一个非常有趣的数据集,可能避免了存在版权所有者时所涉及的棘手问题。但它有两个缺点。首先,由于版权期限很长,1920 年之后出版的书籍很少属于公有领域。这意味着大数据迄今为止最大的时期——20 世纪和 21 世纪初——几乎完全没有体现。其次,过时的版权法常常使任何特定书籍的地位变得模糊不清。这种模糊性影响了 Google 馆藏中的大量图书。由于不清楚哪些书籍应该被收录,这个影子的计算难度可能出奇地大。
Another shadow is the full text of all public-domain books—all books on which the copyright has expired. This is a really interesting dataset, potentially free of the thorny issues involved when there are rights holders. But it has two drawbacks. First, since copyright extends for so long, few books published after 1920 are in the public domain. This means that the periods in which the big data is by far the biggest—the twentieth and early twenty-first centuries—are almost totally unrepresented. Second, the antiquated laws that govern copyright often leave the status of any particular book ambiguous. Such ambiguities affect a vast number of books in Google’s collection. Because it’s unclear which books should be included, this shadow can be surprisingly difficult to compute.
给 Norvig 什么建议?
What to suggest to Norvig?
我们回想起凯伦·莱默的《传奇的、词汇丰富的、饶舌的爱情》。如果这个故事是西方文明历史记载的重要组成部分,而作者几乎是每个人,那么翻阅莱默的这本书,通过词频揭示故事及其作者的隐秘心理,岂不是更有趣?
We thought back to Karen Reimer’s Legendary, Lexical, Loquacious Love. Wouldn’t the experience of leafing through Reimer’s book, the way that word frequency revealed the hidden psyche of the story and of its author, be more interesting if the story was a big chunk of the historical record of Western civilization, and if the author was, more or less, everyone?
我们越想,就越觉得她的字母小说中隐含着一种既简单又美丽的影子,美丽,美丽,美丽,美丽。为什么我们不直接在谷歌图书中公布这些词频呢?
The more we thought about it, the more her alphabetical novel seemed to hint at a shadow that was both simple and beautiful, beautiful, beautiful, beautiful, beautiful. Why didn’t we just expose the word frequencies in Google books?
更准确地说,我们的想法是创建一个影子数据集,为出现在英文书籍中的每个单词和短语创建一个单独的记录。这些单词和短语——用计算机科学的术语来说,就是 n-gram——包括3.14159(1 个 gram)、香蕉船(2 个 gram)和美利坚合众国(5 个 gram)。对于每个单词和短语,记录将包含一长串数字,显示该特定 n-gram 在书籍中出现的频率,年复一年,回溯到五个世纪前。这不仅会非常有趣,而且在我们看来,它在法律上可能也无害。赖默从未因出版他人小说的字母顺序版本而被起诉。
More precisely, our idea was to create a shadow dataset containing a single record for every word and phrase that appeared in English books. These words and phrases—the fancy computer science term is n-gram—include 3.14159 (a 1-gram), banana split (a 2-gram), and the United States of America (a 5-gram). For each word and phrase, the record would consist of a long list of numbers, showing how frequently that particular n-gram appeared in books, year after year, going back five centuries. Not only would this be extremely interesting, it seemed to us that it would probably be legally innocuous. Reimer was never sued for publishing an alphabetical version of someone else’s novel.
但仍然存在一个危险:如果黑客知道如何利用公开的词频和短语频率数据重建所有书籍的全文怎么办?用微小、重叠的片段拼凑出海量文本并非明显不合理的策略。事实上,一种类似的方法是现代基因组测序的基础——科学家用来读取细胞内 DNA 的方法。
But there was still one danger: What if a hacker figured out how to use the public data on word and phrase frequencies to reconstruct the full text of all the books? Assembling a massive text from tiny, overlapping snippets is not an obviously unreasonable strategy. In fact, an analogous method is the basis of modern genome sequencing—the approach used by scientists to read the DNA inside a cell.
为了解决这个问题,我们依赖一个统计事实:你无需翻阅任何一本书就能找到一个独特的表达方式。例如,前面的句子可能是“找到一个独特的表达方式”这个短语的唯一用法,或者至少在出现这句话之前是这样。因此,我们添加了一个简单的修复方案:我们的影子数据不会包含只出现过几次的单词和短语的频率数据。经过这一修改,从数学上来说,重建全文将是不可能的。
To solve this problem, we relied on a statistical fact: You don’t have to go far in any given book to bump into a unique formulation. For instance, the previous sentence was probably the only use ever of the phrase “bump into a unique formulation,” or at least, it was, until this sentence came along. So we added a simple fix: Our shadow would not include frequency data for words and phrases that had been written only a handful of times. With this modification, reconstructing the full texts would be mathematically impossible.
由此产生的阴影——ngrams——似乎极其前景光明。文本所享有的版权保护不会受到任何损害(标准1)。我们从对不规则动词的研究以及赖默的小说中了解到,仅仅通过追踪单个单词的频率就能获得多少洞见(标准2)。这将是一种强大的概念搜索新方法,因此对于一家以搜索为基础的公司来说,这是一个颇具吸引力的想法(标准3)。此外,计算单词可能是计算机科学中最简单的问题(标准4)。
The resulting shadow—the ngrams—seemed extremely promising. The copyright protection the texts enjoyed would not be compromised at all (criterion 1). We knew from our work on irregular verbs, and from Reimer’s novel, how much insight could be gleaned just by tracking the frequency of a single word (criterion 2). It would be a powerful new way to search for concepts, and thus an appealing notion for a company built on search (criterion 3). And counting words is possibly the simplest problem in computer science (criterion 4).
当然,如果我们仅限于 ngram 数据,这些词语几乎会被剥离所有上下文,因此我们无法判断撰写伊利亚·卡赞文章的人究竟是在说他是一位伟大的导演,还是说他在“红色恐慌”期间指名道姓背叛了朋友。但这并非缺陷,而是一个特性:正是上下文使得这些数据具有法律敏感性。摆脱了上下文,我们就可以有力地证明,我们的影子数据集及其支持的工具不仅可以与我们研究人员共享,还可以与全世界共享。我们的影子数据集恰到好处:这是在不违法的情况下你能获得的最大乐趣。
Of course, if we limited ourselves to ngram data, the words would be stripped of nearly all context, so we wouldn’t be able to tell if someone writing about Elia Kazan was arguing that he was a great director or that he betrayed his friends by naming names during the Red Scare. But that’s not a bug, it’s a feature: The context was exactly what had made the data legally sensitive. Freed of context, we could make a strong case that our shadow dataset, and the tools it powered, could be shared not only with us, the researchers, but with the entire world. Our shadow hit the spot: It’s the most fun you can have without breaking the law.
Ngrams 就是我们的答案。Norvig 思考了一会儿,觉得这个想法值得一试。他帮我们组建了一个团队:谷歌工程师 Jon Orwant 和 Matt Gray,以及我们的实习生袁申。
Ngrams were our answer. Norvig thought about this idea for a minute and decided it might be worth a shot. He helped us assemble a team: Google engineers Jon Orwant and Matt Gray, and our intern, Yuan Shen.
We were in. Suddenly we had access to the biggest collection of words in history.
语言是由词语组成的。但是,词语是什么呢?
Language is assembled out of words. But what is a word?
这是一个严肃的问题。想想政客们。乔治·W·布什总统在其整个职业生涯中,偶尔会在语言上发挥创意,比如在“低估”一词前加上前缀“mis-”。这些布什主义使他经常成为笑柄,并成为深夜电视节目的出气筒。政客们的语言受到严密监控,即使是像拼写不规范这样看似微不足道的小事,也可能成为热门话题。土豆。”前副总统丹·奎尔在回忆录中将公开拼错“土豆”的经历描述为“不仅仅是一次失言,而是可以想象到的最糟糕的决定性时刻。”然而,莎拉·佩林在公开使用“土豆”一词后却面临公众的嘲笑在推文中予以驳斥,指出她与其他政客一样,也被施以双重标准。毕竟,她在推特上写道:“英语是一种活的语言。莎士比亚也喜欢创造新词。”
It’s a weighty issue. Consider politicians. Throughout his career, President George W. Bush occasionally got creative with language, doing things like adding the prefix mis- before the word underestimated. These Bushisms made him the frequent butt of jokes and a punching bag for late-night TV. The language used by politicians is so carefully monitored that even something as seemingly minor as nonstandard spelling can be a hot “potatoe.” In his memoirs, former vice president Dan Quayle described the experience of publicly misspelling potato as “more than a gaffe. It was a defining moment of the worst imaginable kind.” Yet Sarah Palin, faced with public ridicule after she used the word refudiated in a tweet, pointed out that, like other politicians, she was being held to a double standard. After all, she tweeted, “English is a living language. Shakespeare liked to coin new words too.”
她是对的。莎士比亚的戏剧充满了新词。事实上,像布什一样,莎士比亚是一位社会保守派,前缀是自由主义者。他经常使用导致布什创造misunderestimate 的相同策略来创造新词。但与布什不同的是,莎士比亚成功了,随着他创造的词汇被广泛采用,留下了巨大的词汇遗产。例如,他使用前缀lack-来创造lack-beard、lack-brain、lack-love和lack-luster(这个词后来的发展却并非如此)等词。通常,诗人比政客享有更多的词汇自由。刘易斯卡罗尔的诗“Jabberwocky”主要由卡罗尔创造的单词组成 - 他可能会咯咯地笑出声来,其中有多少单词在今天被认为是正确的英语。
And she’s right. Shakespeare’s plays are chock-full of neologisms. In fact, like Bush, Shakespeare was a social conservative and a prefix liberal. He often created new words using the same strategy that led Bush to create misunderestimate. But unlike Bush, Shakespeare got away with it, leaving a vast lexical legacy as his coinages became widely adopted. For instance, he used the prefix lack- to create words like lack-beard, lack-brain, lack-love, and lack-luster (a word whose subsequent career has been anything but). Poets in general enjoy more lexical leeway than politicians do. Lewis Carroll’s poem “Jabberwocky” is composed mostly of words that Carroll made up—and he’d probably chortle at how many of them are considered proper English today.
那么,我们怎样才能决定哪些词语可以使用,哪些词语会让我们成为深夜笑柄呢?
So how can we decide what words are okay to use, and which will transform us into a late-night punch line?
词典编纂者。词典的作者;无害的苦工……
Lexicographer. A writer of dictionaries; a harmless drudge . . .
—塞缪尔·约翰逊,《英语词典》,1755 年
—Samuel Johnson, A Dictionary of the English Language, 1755
词典至少在原则上解决了什么是词,什么不是词的问题。毕竟,词典是官方认可的词汇目录,每个词汇都配有相应的含义列表。许多词典旨在提供便捷的参考,例如《美国传统词典》第四版收录了约11.6万个词汇。有些词典的规模更大,其中最宏大的莫过于内容全面的、长达23卷的牛津英语词典。牛津英语词典于1928年首次出版,最新版本收录了44.6万个词汇。如果你想知道要想知道哪些词是该语言的正式组成部分,就去查字典。如果字典里有,那就是一个词。如果没有,那就不是。
Dictionaries, at least in principle, solve the problem of what is or is not a word. After all, dictionaries are catalogs of officially approved words, each paired with a list of approved meanings. Many dictionaries are meant to be handy references, like the American Heritage Dictionary, whose fourth edition lists about 116,000 words. Some dictionaries are more ambitious, none more so than the comprehensive, twenty-three-volume compendium known as the Oxford English Dictionary. First completed in 1928, the most recent edition of the OED lists 446,000 words. If you want to know what words are officially part of the language, dictionaries are the place to go. If it’s in the dictionary, it’s a word. If it isn’t, it’s not.
但即便如此,我们仍然面临一个难题:编纂词典的词典学家究竟是如何知道应该收录哪些词汇的呢?
But even if that’s the case, we still have a puzzle on our hands. How exactly do the lexicographers who write dictionaries know which words to include?
关于其如何运作有两种观点。
There are two ideas about how this works.
有一种理论认为,词典编纂者的工作是规范性的。根据这种观点,词典编纂者负责语言中包含的内容,并通过编写词典来规定我们应该使用什么词,不应该使用什么词。这就是泰迪·罗斯福对词典编纂的“公牛驼鹿”观点。1906 年,他命令政府印刷局开始使用一种大大简化的拼写:I have replied your grotesque telephone变成了I hav anserd yur grotesk telefone。这在国会并不受欢迎,原来的拼写也保留了下来。在今天的法国,规范性的词典编纂观点仍然占主导地位,政府时不时会发布关于正确用词和拼写的官方文件。2013 年 1 月,《官方公报》建议用mot-dièse (大致意思是“词-磅号”)代替#tag。当然,推特圈对此做出了集体回应#大笑。这种规定性方法的问题在于,它并不明确地表明谁应该或应该负责语言。语言超越任何特定的政府、种族或国籍。
One theory is that the lexicographer’s job is prescriptive. According to this view, lexicographers are in charge of what is in the language, and in writing dictionaries, they legislate what words we should and should not use. This was Teddy Roosevelt’s “Bull Moose” view of lexicography. In 1906, he ordered the Government Printing Office to begin using a drastically simplified spelling: I have answered your grotesque telephone became I hav anserd yur grotesk telefone. This did not go down well with Congress, and the original spellings remained untouched. The prescriptive view of lexicography is still dominant today in France, where every now and then the government publishes an official document about correct word usage and spelling. In January 2013, the Journal Officiel recommended that hashtag be replaced with mot-dièse (roughly, word–pound-sign). Of course, the Twitterverse responded with a collective #ROFL. The problem with the prescriptive approach is that it’s not obvious that anyone is, or should be, in charge of language. A language transcends any particular government, ethnicity, or nationality.
另一种观点——如今更为广泛接受,尤其是在美国——认为词典编纂者的工作并非规定性——告诉我们该做什么——而是描述性——记录我们在自主决定时所做的事情。根据这种观点,词典编纂者并非君主,而是探险家。词典是他们所发现事物的地图。
A different idea—one that is more widely believed today, especially in the United States—is that the lexicographer’s job is not prescriptive—telling us what to do—but instead descriptive—reporting what we do when left to our own devices. According to this approach, lexicographers are not monarchs but explorers. A dictionary is a map of what they have found.
但这个想法也存在一个问题。如果词典编纂者不能凭命令来定义一个词,那么他们是否也有可能犯错呢?我们究竟能有多依赖词典?
But there’s a problem with this idea, too. If lexicographers can’t decide what a word is by fiat, then isn’t it possible for them to make a mistake? How much can we really rely on the dictionary?
毕竟,词典编纂者也是普通人。当然,他们可能比普通人更关注用法的细微差别。但在决定哪些词应该收录进词典时,词典编纂者通常会做和我们其他人一样的事情。他们会倾听人们的诉求,阅读大量书籍,并尽力观察趋势:人们正在使用哪些新词?哪些词已经不再使用?哪些词条出现在了竞争对手的词典中?
After all, lexicographers are ordinary people. Sure, they may be more interested in nuances of usage than the average person on the street. But when trying to figure out what words to include in their dictionaries, lexicographers typically do the same kinds of things the rest of us do. They listen to what people are saying. They read a lot. They try their best to notice trends: What new words are people using? What words have they stopped using? What entries are popping up in competing dictionaries?
一旦形成了个人印象并确定了候选词,词典编纂者就会试图确定这些印象是否真实。我们认识一位词典编纂者,在判断某个词是否为真实词汇时,会使用以下标准:他会尝试在不相关的文章中找到该词的四个例子。词典编纂团队达成共识固然可取,但专业术语——例如是否收录“石墨烯”之类的词汇——可能由一位负责物理学的顾问来判断。编纂词典并非一门科学,而是一门有着数百年历史的艺术。
Once they form those personal impressions and identify a candidate word, lexicographers try to figure out if those impressions are real. One lexicographer we know, when trying to decide if something is a real word, uses the following criterion: He tries to find four examples of that word in unrelated pieces of writing. Consensus among the lexicographic team is desirable, but technical jargon—the decision whether to include a word like graphene—may be left up to the judgment of the one consultant who handles physics. Writing dictionaries is not a science. It’s a centuries-old art.
以《美国传统词典》为例。它的第四版出版于2000年,比第三版晚了八年。在这八年里,新词不断涌现。AHD的编辑们竭尽全力去寻找这些词汇。他们的成果包括amplidyne(一种发电机)、mesclun(一种沙拉)、netiquette(互联网礼仪)和phytonutrient(赋予植物颜色、味道和气味的化学物质)。他们的成果如何?
Take the American Heritage Dictionary. Its fourth edition was published in 2000, eight years after its third edition. In those eight years, new words had entered the language. The editors at the AHD did their best to hunt those words down. Their trophies included amplidyne (a type of power generator), mesclun (a type of salad), netiquette (Internet etiquette), and phytonutrient (the chemicals that give plants their color, flavor, and smell). How well did they do?
这张图表清楚地表明,AHD的记录充其量是好坏参半。在某些情况下,例如mesclun和netiquette,它们只是姗姗来迟。仅从频率来看,这两个词应该在 1992 年就符合AHD 的收录标准。至于amplidyne,它的盛况早已结束;amplidyne 在 20 世纪初达到顶峰,如今已完全过时。尽管词典编纂者竭尽全力,但他们仍难以及时发现新词,而且可能落后数十年。
This graph makes it clear that the AHD’s record is mixed at best. In some cases, like mesclun and netiquette, they were merely late to the party. Purely on the basis of frequency, both words should have qualified for the AHD in 1992. In the case of amplidyne, the party was long over; amplidynes peaked in the early twentieth century and are completely obsolete today. Despite their best efforts, lexicographers are hard pressed to detect new words in time, and can be decades behind.
当我们看到这个图时,我们知道——至少在识别单词方面——能够通过一次点击读取数十亿个句子对于词典编纂者来说可能是天赐之物。
When we saw this plot, we knew that—at least when it came to identifying words—being able to read billions of sentences in one click could be a godsend to lexicographers.
我们决定创建自己的描述性词典,收录当代英语中使用的所有词汇。我们的想法很简单:如果一串字符在当代英语文本中出现的频率足够高,那么它就是一个词。频率有多频繁够了吗?自然的截断方法是使用词典中最稀有词的频率,我们计算出,大约每十亿个文本单词中就会出现一个这样的词。因此,对于“什么是词?”这个问题,我们的答案是:
We decided to create our own descriptive lexicon containing all of the words used in contemporary English. Our idea was simple: If a string of characters is frequent enough in contemporary texts written in English, then it’s a word. How frequent is frequent enough? The natural cutoff is to use the frequency of the rarest words in dictionaries, which we calculated was roughly one instance in every billion words of text. So our answer to the question “What is a word?” is:
一个英语单词是一个 1-gram,平均而言,在每十亿个 1-gram 英语文本中至少出现一次。
An English word is a 1-gram that appears, on average, at least once in every billion 1-grams of English text.
这显然不是对一个词的完美定义。“英语文本”是否包含可能嵌入在英语段落中的西班牙语引文?文本必须是最近的吗?它应该来自书籍、转录的语音还是互联网?我们真的应该把常见的拼写错误(例如excess)算作单词吗?那么部分数字形式(例如l8r)呢?像straw man这样的二元语法(2-gram)真的不能算作一个词吗?
This obviously isn’t a perfect definition of a word. Does “English text” include a Spanish quotation that might be embedded in an otherwise English passage? Does the text have to be recent? Should it come from books? Transcribed speech? The Internet? Should we really count common misspellings, like excesss, as words? What about partly numerical forms, like l8r? And can’t a 2-gram, like straw man, actually be a word?
尽管这个定义有诸多缺陷,但它实际上相当精确。它足够精确,只需有了这个定义、一份篇幅足够长的、大家一致认可的参考文本,以及一些计算机,就能创建一个客观的英语词典。从这一点来看,我们的定义比大多数参考文献中高度主观的表述要好得多。
Yet for all its faults, this definition is actually pretty precise. It’s precise enough that, furnished only with this definition, an agreed-upon reference text of sufficient length, and a bunch of computers, one can create an objective lexicon of the English language. In this one respect, our definition is better than the highly subjective formulations found in most references.
我们希望确保新的 Zipfian 词典能够代表当代用法,所以我们并没有将所有书籍都放进去。相反,我们选取了十年的数据片段——数据库中所有出版于 1990 年至 2000 年的书籍。这个集合包含了超过 500 亿个 1-gram。一个 1-gram 要达到十亿分之一的截止频率,它必须在集合中至少出现 50 次。最终的列表包含 1,489,337 个单词,例如unhealthiness(不健康)、6.24( 6.24) 、psychopathy(精神病态)和Augustean(奥古斯丁)。
We wanted to make sure that our new, Zipfian lexicon represented contemporary usage, so we didn’t just throw all our books at it. Instead, we took a ten-year slice of our data—all the books in our database that were published between 1990 and 2000. This collection contained more than fifty billion 1-grams. For a 1-gram to meet our cut-off frequency of one in a billion, it had to appear at least fifty times in the collection. The resulting list contained 1,489,337 words, like unhealthiness, 6.24, psychopathy, and Augustean.
我们的 Zipfian 词典是一个非常方便的参考。如果一个词没有出现在这个列表中,那么它的频率就不如词典中出现频率最低的词——而且,我们有理由认为它不是一个词。如果它出现了,那么这个词很可能已经足够频繁,足以被收录进词典。如果它没有被收录进词典,人们就不得不思考为什么。
Our Zipfian lexicon is a pretty handy reference. If a word doesn’t appear in this list, then it’s not as frequent as the least frequent words in the dictionary—and it’s pretty reasonable to argue that it isn’t a word. If it appears, the word is probably frequent enough to warrant inclusion in the dictionary. If it is not included in dictionaries, one has to wonder why.
这就是拥有客观词典的乐趣之一。这么多年来,无论是在学校还是在玩拼字游戏,词典都被用来测试你。现在,有了独立评估词典的方法,情况就完全不同了,你就可以测试词典及其编纂者的准确性。几个世纪以来,一直都有空想词典编纂者,但只有运用ngrams才能成为空想词典学家。(词典学:研究无害的苦差事。词典学家:更无害的苦差事。)
This is one of the fun things about having an objective lexicon. All these years, whether at school or at Scrabble, the dictionary has been used to test you. With an independent way of assessing the lexicon, the shoe is on the other foot, and it’s possible for you to test the accuracy of the dictionary and the lexicographers who wrote it. There have been armchair lexicographers for centuries, but only with ngrams can one become an armchair lexicographerologist. (Lexicographerology: the study of harmless drudges. Lexicographerologist: an even more harmless drudge.)
接下来,我们提出了词典学中最基本的问题:这本词典收录了多少 Zipfian 词典?
Next, we asked the most fundamental question in all of lexicographerology: How much of our Zipfian lexicon did the dictionary catch?
少得惊人。最全面的英语词典《牛津英语词典》收录的词汇不到五十万。它的词汇量大约是我们词典的三分之一。其他词典的词汇量都比它小。
Surprisingly little. The Oxford English Dictionary, the most comprehensive of English-language dictionaries, contains less than five hundred thousand words. Its lexicon is roughly a third the size of our list. All other dictionaries are smaller.
怎么会这样?词典编纂者真的对自己语言中发生的事情一无所知吗?
How could this be? Was it really true that lexicographers were so unaware of what was going on in their own language?
我们有点操之过急。大多数词典并不声称涵盖该语言的所有词汇。事实上,许多词典原则上会谨慎地排除几类术语,无论它们有多常见:
We have been a bit hasty. Most dictionaries don’t claim to capture all words in the language. In fact, many dictionaries, on principle, are careful to exclude several types of terms, regardless of how common they might be:
因此,当我们把词典里根本不想收录的内容放进词库时,却说“抓到你了!”,这很不公平。为了了解词典里哪些词是无意遗漏的,我们估算了词库中来自上述四个类别的词占比。
As such, it’s unfair of us to go “gotcha!” when we include things in our list that the dictionary isn’t even trying to include. To get a sense of what the dictionary leaves out that it didn’t deliberately mean to leave out, we estimated what fraction of our word list came from the above four categories.
这样,我们的词库就从不到150万个单词缩减到100多万个。尽管如此,我们的Zipfian词典的词条数量仍然是牛津英语词典的两倍多。即使是最全面的英语词典也会遗漏大多数单词。这些未收录的词汇包含许多丰富多彩的概念,例如干旱化(地理该地区变得干旱)、slenthem(一种乐器),以及恰当地说,deletable这个词。
That cut our list down from just under 1.5 million to a little over a million words. Still, our Zipfian lexicon had more than twice as many entries as the Oxford English Dictionary. Even the most comprehensive dictionary of the English language misses most words. These undocumented words included many colorful concepts, like aridification (the process by which a geographic region becomes dry), slenthem (a musical instrument), and, appropriately enough, the word deletable.
那么,什么原因导致字典出错呢?
So, what trips up dictionaries?
频率。事实证明,词典对高频词的覆盖率非常高。词典堪称完美——它们几乎涵盖了所有词汇的 100%——只要这些词汇的出现频率高于百万分之一,例如dynamite这个词。如果一个词在平均十本书中至少出现一次,词典就会像钟表一样精准地记录并定义它。
Frequency. It turns out that dictionaries have excellent coverage of frequent words. Dictionaries are completely perfect—they literally contain 100 percent of all words—as long as those words are more frequent than one in a million, such as the word dynamite. If a word appears at least once in the average pile of ten books, the dictionary will record it and define it, like clockwork.
但词典编纂者却难以处理罕见词汇。当一个词的频率低于百万分之一时,词典将其省略的概率就会急剧上升。当频率略高于十亿分之一时,词典只能收录四分之一的词汇。
But lexicographers struggle with the rare stuff. As a word’s frequency drops below one in a million, the chances that a dictionary has omitted the word will skyrocket. At frequencies just north of one in a billion, the dictionary only notices a quarter of all words.
如果你要记住齐普夫的一句话,那就是大多数词其实都是非常罕见的。所以,如果词典漏掉了大多数罕见词,那么它就漏掉了大多数词,就是这样。
And if there is one thing you should remember from Zipf, it is that most words are really rare. So, if dictionaries miss most rare words, then they miss most words, full stop.
结果发现,英语中 52% 的词汇(书籍中使用的大部分词汇)都是词汇暗物质。就像暗物质构成了宇宙的大部分,词汇暗物质构成了我们语言的大部分,但却未被标准参考文献所发现。
As a result, it turns out that 52 percent of the English language—most of the words used in books—is lexical dark matter. Like the dark matter that makes up the majority of the universe, lexical dark matter makes up the majority of our language, but goes undetected by standard references.
随着传统词典编纂的局限性日益凸显,该领域开始发生变化。像wordnik.com、wiktionary.com和urbandictionary.com这样的新进入者,开始依赖纸上谈兵的词典编纂者来构建全面的在线词典。实际上,他们正试图利用众包来记录所有“暗物质”。像《牛津英语辞典》这样的传统词典也希望深入研究大数据。为了使其汇编与时俱进,他们正在用一种新兴的数据驱动词典编纂方式来补充现有方法。(甚至还融入了词典学的元素!)
As the limitations of traditional lexicography have become increasingly apparent, the field has begun to change. New entrants, like wordnik.com, wiktionary.com, and urbandictionary.com, have come to rely on armchair lexicographers in their efforts to build comprehensive online dictionaries. In effect, they are attempting to use crowdsourcing to document all of the dark matter. Traditional dictionaries like the OED are hoping to dive into big data, too. To bring their compendia up to speed, they are supplementing existing methods with an emerging style of data-driven lexicography. (And even with a touch of lexicographerology!)
总的来说,这些进展对词典编纂者来说无疑是个好消息。尽管经过了几个世纪的努力,但大部分工作仍有待完成。总的来说,英语是一片未知的大陆。
Overall, these developments are certainly good news for lexicographers. Despite centuries of effort, most of the work remains to be done. English is, by and large, an uncharted continent.
新词总是让人兴奋不已。美国方言协会每年都会举办一次会议来表彰这些新词。会员们会投票选出以下类别的新词:年度词汇、最离谱词汇,甚至最不可能成功,这是我们自己创造的词汇——文化组学——在2010年获得的殊荣。自1991年以来,年度词汇包括cyber(1994年)、e-(1998年)、metrosexual(2003年),以及最近的hashtag(法语单词mot-dièse,以防法国政府阅读)。美国方言协会编制的榜单证明了语言本身就不断欢迎和庆祝新词。
New words always get people excited. Every year, the American Dialect Society holds a meeting to honor all these new words. Members vote on categories like Word of the Year, Most Outrageous, and even Least Likely to Succeed, a distinction that our own coinage—culturomics—earned in 2010. Since 1991, Words of the Year have included cyber (1994), e- (1998), metrosexual (2003), and most recently hashtag (mot-dièse, in case the French government is reading). The lists compiled by the American Dialect Society testify to a language that is constantly welcoming and celebrating new words.
但在词汇生命周期的另一端,却鲜有动静。似乎没人有兴趣为已经消亡的词汇举行葬礼。因此,很难判断英语的诞生率是否超过了死亡率——英语究竟是在增长、萎缩,还是保持稳定。
But there’s very little activity at the other end of the lexical life cycle. Nobody seems interested in holding funerals for words that have died. So it’s hard to tell whether the birth rate exceeds the death rate—whether English is growing, shrinking, or remaining stable.
为了找到答案,我们又创建了两个 Zipfian 词典。第一次,我们使用了 1990 年至 2000 年之间出版的文本来创建当代词典。这次,我们使用了两个历史时期:1900 年之前的十年和 1950 年之前的十年。
To find out, we created two more Zipfian lexicons. The first time around, we had used texts published between 1990 and 2000 to create a contemporary lexicon. This time, we used two historical periods: the decade preceding 1900, and the decade preceding 1950.
我们发现,到1900年,该词典已收录超过55万个词条。这比今天的《牛津英语词典》收录的词条还要多。在接下来的五十年里,似乎没有发生什么大的变化,英语的规模保持稳定。出生和葬礼仍然有效。
We found that by 1900, the lexicon already contained more than 550,000 entries. That’s more words than are in today’s Oxford English Dictionary. For the next fifty years, not much seemed to happen, and the language remained stable in size. Births and funerals managed to hold serve.
但在1950年至2000年间,英语进入了增长期,随着数十万新词的加入,其规模几乎翻了一番。新词诞生的数量远远超过了词汇的“临终仪式”。目前,每年约有8400个词汇进入英语——而如今,每年新增词汇的数量已超过20个。
But between 1950 and 2000, English entered a period of growth, nearly doubling in size as hundreds of thousands of new words were added. New births dramatically outnumbered lexical last rites. Currently, about 8,400 words enter the English language each year—more than 20 new words crossed the threshold today.
Our language is not only changing—it’s growing.
这是为什么呢?没有人真正知道答案,不过,就像幂律的成因一样,人们对此也有很多猜测。一种假设是,随着社会联系日益紧密(我们与更多人保持联系),世界变得越来越小(人们最多只需一个电话或一趟飞机就能联系到),新词汇更容易达到临界质量。另一种假设认为,随着科学、医学和技术的进步,随着专业术语进入日常用语,新词汇也随之涌现。还有一种可能性在于书籍记录本身的多样化,而书籍记录正是我们齐普夫词典的基础。随着20世纪末社会各阶层开始出版书籍,作者们用更广泛的方言撰写了更多主题,为全球讨论引入了更多词汇。
Why is that? Nobody really knows, although, as with the cause of power laws, conjectures abound. One hypothesis is that as our society becomes increasingly connected (we keep in touch with more people) and our world gets smaller (people are at most a phone call or a plane ride away), new words reach critical mass more easily. Another hypothesis suggests that progress in science, medicine, and technology introduces new words as jargon enters the general parlance. Yet another possibility lies in diversification within the book record itself, the basis of our Zipfian lexicon. As a broader cross section of society began to publish books in the late twentieth century, authors wrote about more topics in a wider range of dialects, introducing more words to the global discussion.
说实话,没人能确切知道答案。由于我们不知道这种影响从何而来,也很难猜测它最终会走向何方。每年诞生的词汇数量会增加吗?词汇量的上限是多少?你孩子的语言和你的会有多大区别?随着大数据的视野逐渐照亮我们的语言,它们照亮了通往全新科学领域的道路,在那里,就连大脚怪也无处可藏。
Truth be told, no one knows for sure. And since we don’t know where this effect is coming from, it is hard to guess where it is all going. Will the number of words born each year increase? What is the limit on the size of the lexicon? How different will your kid’s language be from yours? As the scopes of big data illuminate our language, they light the way to a new scientific landscape, one where even the Sasquatch has nowhere to hide.
但我们所使用的词语所讲述的故事远比语言本身更宏大。它们是我们了解思想、习俗和社会的窗口。因此,让我们将视野从沟通机制转向思考的本质。
But the words we use tell a story much greater than that of our language. They are a window into our thoughts, our mores, and our society itself. So let’s turn our scope away from the mechanism of our communication, and toward the substance of our thought.
我到了二十世纪中期,事实证明,雇佣保姆来照顾婴儿是个非常好的主意。由于“婴儿”和“保姆”这两个词的兴趣如此契合,他们开始花很多时间在一起,保姆也变得越来越频繁。
In the mid–twentieth century, it turned out that taking care of a baby using a sitter was a very good idea. Since the words baby and sitter had such compatible interests, they started spending a lot of time together, and baby sitter became increasingly frequent.
很快,人们开始认为他们形影不离。他们用连字符表示此连接。随着关系越来越认真,保姆的出现也越来越频繁,保姆也开始被替换。
Soon, people started to view them as joined at the hip. They represented this join with a hyphen. As the relationship got more serious, baby-sitter became increasingly frequent, and baby sitter started to get replaced.
最终,宝宝和保姆意识到他们是天作之合。一个孩子由此诞生。亲爱的孩子,这就是你父母把你留给我这个保姆的原因。
Eventually, baby and sitter realized they were a match made in heaven. A child was born of this union. And that, dear child, is why your parents left you here with me, the babysitter.
7.5分钟的成名
7.5 MINUTES OF FAME
Getting rid of crap is not sexy. But it can be heroic.
问问希腊神话中的英雄赫拉克勒斯就知道了。赫拉克勒斯的十二功绩中的第五功是清理奥吉亚斯的牛圈,那里关押着数千头不朽的牛。由于牛圈三十年未曾清扫,积攒了大量粪便。赫拉克勒斯在一天之内就改变了两条湍急河流的流向,净化了牛圈。他的英勇事迹至今仍是粪便工程史上最伟大的成就之一。
Just ask Hercules, the hero-god of Greek mythology. For his fifth of twelve labors, Hercules was tasked with cleaning the Augean stables, which housed thousands of immortal cows. Because the stables had not been cleaned in thirty years, they had come to contain a sizable cache of waste. Hercules redirected two raging rivers to purge the stables in a single day. His heroic deed remains one of the greatest achievements in the annals of scatological engineering.
几千年后,类似的传奇必将流传于我们人类的计算力赫拉克勒斯——元神身上。谷歌花了五年时间,在世界知识的丰饶牧场上牧草,其迅捷的扫描过程吞噬了数百万本书籍。然而,作为打造全球最大数字永生图书库的必然副产品,谷歌也积累了大量垃圾数据。大数据本身就很乱。是时候清理这个“库”了。
Millennia from now, similar legends will surely be told about Yuan Shen, our own computational Hercules. Google had spent five years grazing at the rich pastures of world knowledge, its swift scanning process ingesting books by the million. Yet as an inevitable by-product of having created the world’s largest stable of digitally immortalized books, the company had accrued a significant quantity of poop-grade data as well. Big data is messy. The time had come to clean the stable.
最近您花了多少时间阅读图书馆卡片目录?
How much quality time have you spent with a library card catalog lately?
卡片目录曾经是图书馆借阅的核心。图书馆里的每本书都有一张卡片,上面记录着书名、作者、主题、出版年份以及至关重要的索书号等重要信息,索书号指示着这本书的存放位置。图书馆的访客整天都涌向卡片目录,而目录中的信息又会把他们吸引到书架最远的角落。
Card catalogs used to be the heart of library circulation. There was one card for every book in the library, containing vital facts like the title, the author, the subject, the year of publication, and the all-important call number, which indicated where the book was located. Library visitors would stream into the card catalog all day long, and the information in the catalog would, in turn, pump them into the farthest corners of the stacks.
没有卡片目录,图书馆就会变成一张楼房大小的杂乱书桌:你什么也找不到。几个世纪以来,最重要的图书馆之一——梵蒂冈秘密档案馆(Archivio Segreto Vaticano,梵蒂冈秘密档案馆(梵蒂冈秘密档案馆)就是这样。它缺乏一个全面的卡片目录来收录占据其五十二英里书架空间的藏书。里面有些什么?即使是那些可以不受限制地访问的人,也只能用事实、谣言和传说来回答。找到一本书的关键在于认识某个人,而这个人又认识另一个人,而这个人又知道(或自认为知道)这本书在哪里。档案馆收藏着可追溯到公元八世纪的珍贵手稿——比如伽利略异端审判的记录——但寻找这些宝藏可能是一场堪比印第安纳·琼斯的冒险。这无疑是保守秘密的一种方式。
Without its card catalog, a library becomes a cluttered desk the size of a building: You can’t find anything. For many centuries, one of the most important libraries, the Archivio Segreto Vaticano (the Vatican Secret Archive), was just this way. It lacked a comprehensive card catalog for the works that occupy its fifty-two miles of shelf space. What was in there? Even those who had unfettered access could answer only with a mixture of fact, rumor, and legend. Finding a book was a matter of knowing someone who knew someone who knew (or thought they knew) where the book was. The archive contains priceless manuscripts dating all the way back to the eighth century—like records from Galileo’s heresy trial—but finding these treasures could be an adventure worthy of Indiana Jones. That’s certainly one way of keeping a secret.
对于我们和其他图书馆用户一样,仅仅访问书籍是远远不够的。如果我们想比较不同时代和地点的文本,就需要准确的卡片目录元数据来告诉我们每本书的内容,这样我们才能知道如何在自动化分析的背景下对其进行分类。
For us, like any other library users, access to the books alone was not nearly enough. If we wanted to compare texts from different times and places, we needed accurate card catalog metadata telling us what each book was, so we would know how to classify it in the context of an automated analysis.
一开始,我们以为这不会是什么大问题:谷歌利用数百个来源的目录信息,整理出了1.3亿本书的购物清单。(如今,各大图书馆的卡片目录都已实现计算机化——这是数字化带来的首批好处之一——而实体卡片本身通常被放在了旁边的房间里。)但事实证明,即使是最好的卡片目录,也充斥着错误。
Going in, we didn’t think that was going to be a big problem: Google had assembled its shopping list of 130 million books using catalog information from hundreds of sources. (These days, the card catalogs of the major libraries have been computerized—one of the first things to benefit from digitization—and the physical cards themselves are often relegated to a side room.) But it turns out that card catalogs, even the best ones, are riddled with errors.
这些错误一旦出现,很难很快得到纠正。卡片数量如此之多,即使是最热心的图书馆用户也未必总能注意到错误。错误要么导致用户无法找到卡片(例如,“非礼勿视,非礼勿听,非礼勿言”),要么错误就在于诸如出版地之类的地方。只要索书号准确无误,用户就能找到这本书。卡片上有问题的元数据不会给读者带来太多困扰,因为正确的信息已经在书的扉页上等着你了。
Once made, these errors don’t get corrected very fast. There are so many cards, and even the most enthusiastic library users don’t always notice the mistake. Either the error prevents a user from finding the card (in which case, “See no evil, hear no evil, speak no evil”), or the error lies in something like the place of publication. As long as the call number is still accurate, the user finds the book anyway. The problematic metadata on the card doesn’t bother the reader much, because the correct information is already waiting on the book’s title page.
随着时间的推移,大量未更正的错误从实体卡片目录流传到数字卡片目录,再到谷歌的“万物之母”,最终流传到我们手中。与那些只读一本书的人不同,我们尤其容易出错:我们负担不起逐一逐一浏览数百万本书的代价。然而,很大一部分卡片都包含错误。当我们使用这些目录元数据生成ngram表时,结果往往杂乱无章,根本无法使用。根据我们最初的计算,隔壁办公室的朋友在16世纪人气飙升。当我们质问她这件事时,她否认自己有那么老。要么她在骗我们,要么我们遇到了一个非常棘手的问题。
Over time, those legions of uncorrected errors made their way from physical card catalogs to digital card catalogs, then to Google’s mother of all catalogs, and then to us. Unlike people who are interested in reading a single book, we were particularly vulnerable to errors: We couldn’t afford to manually look through each of the millions of books. Yet a large fraction of the cards contained mistakes. When we used this catalog metadata to produce ngram tables, the results were often so badly scrambled as to be unusable. According to our initial calculations, our friend in the office next door had enjoyed a surge in popularity during the sixteenth century. When we confronted her about this, she denied being that old. Either she was lying to us, or we had a very big problem on our hands.
该怎么办?
What to do?
由于无法手工翻阅书籍,我们决定编写计算机算法来查找可疑卡片——任何暗示卡片上信息可能有误的内容。例如:杂志。图书馆通常会为每期连续出版物(无论是报纸、学术期刊还是其他期刊)指定第一期的出版日期。这意味着,根据我们的卡片目录, 《时代》杂志的每一期都出版于1923年。对于我们而言,这是一个非常大的问题。
Since we couldn’t go through the books by hand, we decided to write computer algorithms to look for suspicious-looking cards—for anything that suggested that the information on a card might be erroneous. For instance: magazines. Libraries typically assign every single issue of a serial publication—be it a newspaper, an academic journal, or any other periodical—the publication date of the very first issue. That means every issue of Time magazine was, according to our card catalog, published in 1923. For our purposes, this was a very big problem.
为了解决这些问题,我们编写了一个名为“连环杀手”的算法,用于查找任何看起来像是连载出版物的内容。另一个名为“快速约会者”的算法,会查看一本书,并根据其内容猜测其出版时间。这些方法结合起来,帮助我们识别可疑卡片及其所属的书籍。然后,我们可以将这些书籍排除在分析范围之外。
To resolve these issues, we wrote an algorithm called the Serial Killer to find anything that looked like it might be a serial publication. Another algorithm, called the Speed Dater, looked at a book and tried to guess when it was published based on the text it contained. Together, these approaches helped us identify suspicious cards and the books that they belonged to. We could then exclude these books from our analyses.
最终,在2009年夏天,袁征将这些方法与他的软件工程能力结合起来,洗刷掉了污染大数据的垃圾。数百万本书籍被冲入一条海量计算的河流,其规模之大甚至触发了谷歌的内部预警系统。这场史无前例的清洗之后,剩下的书籍数量只是我们最初拥有的一小部分。尽管如此,它的规模和历史深度仍然是前所未有的:五千亿字,跨越五个世纪,以七种不同的语言书写。它包含了有史以来出版的所有书籍的4%以上。
Finally, in the summer of 2009, Yuan combined these methods with his software engineering muscles in order to wash away the crap that was befouling our big data. Millions of books were flushed in a river of computation so massive that it set off Google’s internal warning systems. What was left after this laundering of legendary proportion was only a fraction of what we had started with. Nevertheless, it was still unprecedented in size and historical depth: five hundred billion words, written over five centuries, in seven different languages. It contained more than 4 percent of all books ever published.
同样重要的是,海量数据集也熠熠生辉。尽管文本总量比人类基因组长一千倍,但它的精确度——逐个字母——却比人类基因组计划报告的序列高出十倍。
Just as important, the massive dataset gleamed. Despite the fact that the total amount of text was a thousand times longer than the human genome, it was—letter for letter—ten times as accurate as the sequence reported by the Human Genome Project.
现在,输入文本和卡片目录元数据都已完美无缺,它们生成的 ngram 数据看起来非常棒。我们可以清晰地辨别出大量的语言和文化变迁,例如从throve到thrived 的转变,以及从电报到电话再到电视的演变。从科学角度来说,我们一眼就看到了 ngram 数据,简直是一见倾心。
And now that the input texts and the card catalog metadata were pristine, the ngram data they produced looked great. We could clearly discern a vast array of linguistic and cultural changes, like the shift from throve to thrived, and the progression from telegraph to telephone to television. As soon as we caught a glimpse of the ngram data, it was, scientifically speaking, love at first sight.
但就像许多夏日恋情一样,我们与 ngrams 的爱情到了秋天也会遭遇阻碍。随着袁的实习在学年开始时结束,我们很快就会发现自己又回到了谷歌之外,数据也被留在了公司的防火墙之外。
But like so many summer romances, our love affair with ngrams would face obstacles come fall. With Yuan’s internship wrapping up at the start of the academic year, we would soon find ourselves back outside Google, leaving our data behind the company’s firewall.
我们需要谷歌将数据发送给我们。但这家互联网巨头不愿这样做。根据谷歌的说法,ngram 数据仍然极其敏感。ngram 数据集是根据五百万本书的全文计算得出的,而谷歌的法律考量很简单。五百万本书对应五百万作者,如果数据泄露,就可能引发一场大规模诉讼,而这五百万作者就对应着五百万原告。我们专门设计了 ngram 影子数据集来解决这个问题,通过计算单词而不是记录长文本。但我们这种组合技巧尚未在法庭上得到检验。谷歌的谨慎态度是可以理解的。
We needed Google to send us the data. But the Internet giant didn’t want to. By Google’s account, ngram data was still extraordinarily sensitive. The ngram dataset had been calculated from the full text of five million books, and Google’s legal calculus was simple. Five million books corresponds to five million authors, which corresponds to five million plaintiffs in the massive lawsuit that might result if the data were to leak. We had specifically designed the ngram shadow dataset to get around this problem by counting words instead of recording long stretches of text. But our combinatorial sleight of hand had not yet been tested in a court of law. Google was understandably wary.
面对全球最大公司之一的法律部门,我们手里的牌其实很少。但面对着20亿个ngram,我们还没准备好放弃。
We had very few cards to play when faced with the legal department of one of the world’s largest corporations. But with two billion ngrams in the pot, we were not yet ready to fold.
我们一张张牌都用完了。机遇,比如阿维娃·艾登获奖,这最初为我们打开了谷歌总部的大门。陌生人的善意,比如彼得·诺维格的绿灯和他愿意合作。我们甚至“打电话给朋友”,因为一位失散多年的邻居本·拜尔竟然是谷歌研究院的“时空大师”(这可能是谷歌历史上最伟大的职位)。但还有一张牌我们还没打。
We had exhausted one card after another. Chance, in the form of Aviva Aiden receiving an award, which initially opened the doors of the Googleplex to us. The kindness of strangers, in the form of Peter Norvig’s green light and his willingness to collaborate. We had even “phoned a friend,” when a long-lost neighbor, Ben Bayer, turned out to be the “Master of Space and Time” at Google Research (possibly the greatest job title in corporate history). But there was one card we had yet to play.
我们所有关于量化历史趋势的讨论都引起了史蒂芬·平克是当今最杰出的科学家之一,也是我们一直钦佩的科学家。
All our talk about quantifying historical trends had caught the attention of Steven Pinker, one of the most prominent scientists alive today and someone whom we had always admired.
平克是一位学识渊博、研究精深的心理学家、语言学家和认知科学家。他著有众多畅销书,拥有非凡的洞察力,能够将最复杂的问题提炼出其本质,清晰透彻。例如,有一次,平克做客讽刺新闻节目《科尔伯特报告》。斯蒂芬·科尔伯特问他:“大脑是如何运作的?五个词,甚至更少。” 平克想了几秒钟,回答道:“脑细胞的运作是有规律的。”
Pinker is a psychologist, linguist, and cognitive scientist of extraordinary breadth and depth. The author of numerous bestsellers, he has the uncanny ability to distill the most complex problems to their very essence in a crystal-clear way. For instance, on one occasion, Pinker appeared on the satirical news show The Colbert Report. Stephen Colbert asked him, “How does the brain work? Five words or less.” Pinker thought for a couple of seconds and said, “Brain cells fire in patterns.”
巧合的是,Pinker 的粉丝之一正是 Dan Clancy,他在 2009 年夏天担任 Google 图书运营的负责人。Clancy 的地位很高,单凭他一句话,我们就能在校外访问 ngram 数据了。但 Clancy 是个忙碌而重要的人,没时间理会我们或我们的小项目。不过,随着夏天接近尾声,事情变得明朗起来,如果 Pinker 愿意出席一次会议,讨论ngrams,那么难以捉摸的丹·克兰西 (Dan Clancy) 也会找到时间来做到这一点。
As luck would have it, one of Pinker’s fans is none other than Dan Clancy, who in the summer of 2009 was the head of the entire Google Books operation. Clancy was high enough on the totem pole that his word alone would be enough to get us off-campus access to the ngram data. But Clancy is a busy, important guy who had no time for the likes of us or our little project. Still, as the summer drew to a close, it became clear that if Pinker would be willing to show up for a meeting to discuss the ngrams, then the elusive Dan Clancy would find the time to make it, too.
于是我们问平克:“你看,我们生成了这20亿个ngram——你能帮我们把它们解放出来吗?” 平克觉得我们的工作很有潜力,就答应了。克兰西也同意了。我们只有30分钟的时间来陈述我们的想法。
So we asked Pinker: Look, we’ve generated these two billion ngrams—could you help us liberate them? Pinker thought our work had the potential to be useful and agreed to come. So Clancy agreed to come, too. We had thirty minutes to make our case.
几年前,平克被《时代》杂志评选为全球最具影响力的百人之一。会议开始后,原因显而易见。半个小时的时间足够他施展魔法了。很快,ngram 就来了。
Some years ago, Pinker had been named one of the hundred most influential people on the planet by Time magazine. As the meeting got under way, it was clear why. Half an hour was more than enough time for him to work his magic. Soon, the ngrams were on their way.
那么名气能给你带来什么呢?平克的名气给我们带来了克兰西三十分钟的节目时间。虽然不多,但足够了。
So what does fame buy you? Pinker’s fame bought us thirty minutes of Clancy’s time. Not much—but it was enough.
名声就像一只蜜蜂。
Fame is a bee.
它有一首歌——
It has a song—
它有一根刺——
It has a sting—
啊,它还有翅膀。
Ah, too, it has a wing.
艾米莉·狄金森的这首诗捕捉到了名望的本质:它的诱惑、它的危险、它如何提升一个人,以及它总是飘忽不定,遥不可及。想必狄金森对此深有体会。她或许是美国最著名的诗人。
This poem by Emily Dickinson captures the essence of fame: the allure, the danger, the way it elevates a person, and its tendency to float just beyond our reach. Dickinson, one imagines, should know. She is perhaps America’s most famous poet.
然而,狄金森与名望的关系并非一帆风顺。她对名望的理解源于直觉,而非经验。狄金森生前默默无闻,但她留下的诗歌在她1886年去世近半个世纪后,依然成为人们热议的话题。
Yet Dickinson’s relationship with fame is not straightforward. What she knew about fame she knew from intuition, not experience. A complete unknown during her lifetime, Dickinson left behind poetry that became the subject of widespread discussion nearly half a century after she died in 1886.
狄金森与名望的关系是例外还是常态?名望以各种不同的方式、在各种不同的时间、出于各种不同的原因降临到人们身上,似乎没有固定的路径。查尔斯王子和戴安娜王妃的儿子威廉王子从出生那一刻起就声名鹊起,甚至更早,因为他从娘胎里就注定了自己会成为英国国王。流行歌手贾斯汀·比伯十三岁时就在YouTube上被发掘;五年后,比伯成为地球上被谷歌搜索次数最多的人。有时,毕生的学习可以转化为一夜成名,就像平克那样,当时已经是麻省理工学院的教授,40岁时凭借其畅销书《语言本能》一举成名,享誉全球。另一方面,茱莉亚·查尔德直到40多岁才开始学习烹饪。但这仍然给她留下了足够的时间来彻底改变美国烹饪,并成为国家偶像。
Is Dickinson’s relationship with fame the exception or the rule? Fame finds people in so many different ways, at so many different times, and for so many different reasons that there seems to be no typical route. Prince William, son of Prince Charles and Princess Diana, was famous from the very moment of his birth, or even earlier, given that his destiny to become the king of England was preordained from the womb. Pop singer Justin Bieber was discovered on YouTube when he was only thirteen; five years later, Bieber was the most Googled person on Earth. Sometimes, a lifetime of learning translates into overnight fame, as when Pinker, already an MIT professor, soared to worldwide acclaim at age forty with the publication of his runaway bestseller The Language Instinct. On the other hand, Julia Child didn’t start learning to cook until she was past forty. But that still left her with enough time to revolutionize American cuisine and become a national icon.
就像艾米莉·狄金森一样,许多名人在生前从未体验过名望。文森特·梵高在世时几乎没有一幅画作售出;他去世时,他的天才未被认可。僧侣哥白尼深知,他的伟大思想——地球绕太阳转而非太阳绕地球转——如此具有煽动性,以至于他直到临终才将其发表。在某些行业,身后名声大噪乃常态。正如联邦将军威廉·特库姆塞·谢尔曼所说:“我想我明白了军中名声的含义:战死沙场,名字却被报纸误写。”
Like Emily Dickinson, many of the most famous people never experience fame in their own lifetimes. Almost none of Vincent van Gogh’s paintings sold during his lifetime; he died with his genius unrecognized. The monk Copernicus understood that his big idea—the notion that the Earth circled the sun and not the other way around—was so incendiary that he waited until he was on his deathbed to see it published. In some lines of work, posthumous fame is the norm. As Union general William Tecumseh Sherman put it, “I think I understand what military fame is; to be killed on the field of battle and have your name misspelled in the newspapers.”
还有一些人似乎毫无缘由地出名。像帕丽斯·希尔顿和金·卡戴珊这样名声显赫的名人,他们凭借名气积累的声誉,甚至可能成为一种自我实现的预言。这些人凸显了名声的非凡吸引力:吸引我们的不仅仅是名人的成就,更是他们名气本身。
And then there are the people who appear to be famous for no particular reason at all. Famously famous folks, like Paris Hilton and Kim Kardashian, develop a reputation for being famous that can become a sort of self-fulfilling prophecy. Such people highlight the extraordinary gravitational pull that fame exerts: It is not only the achievements of famous people that draw us to them, but the very fact that they are famous, in and of itself.
考虑到我们对名声的痴迷程度,令人惊讶的是,我们对名声的了解却很少。
Given how fascinated we all are with fame, it’s quite surprising how little we understand about how it works.
名声是什么?就像能量或生命一样,名声是一个我们每天都能理解的概念,却又难以定义。(波特·斯图尔特法官曾就色情作品发表过一句名言:““我一看到就知道了,”他其实也有可能指的是名气。)同样明显的是,名气的量级也多种多样:每个人都知道耶稣比歌手约翰·列侬更有名,列侬比演员亚历克·鲍德温更有名,而鲍德温比热狗冠军小林尊更有名。但同样,“更有名”的确切定义很难找到。就像爱情和美一样,名望难以定义,衡量更是难上加难。然而,如果我们希望理解名望,学习如何衡量它将意义非凡。衡量虽然并非解决所有智力问题的良方,但却是揭开那些原本可能模棱两可、难以捉摸的概念神秘面纱的绝佳工具。
What is fame? Like energy or life, fame is an everyday concept that we all intuitively grasp but find extremely hard to define. (When Justice Potter Stewart famously said of pornography, “I know it when I see it,” he could just as well have been talking about fame.) It’s also clear that fame comes in a wide variety of sizes: Everybody knows that Jesus is more famous than singer John Lennon, that Lennon is more famous than actor Alec Baldwin, and that Baldwin is more famous than hot dog–eating champion Takeru Kobayashi. But again, a precise definition of what it means to be “more famous” is hard to come by. Like love and beauty, fame is hard to define, and harder still to measure. Yet if we hope to understand fame, learning how to measure it would be invaluable. Measurement, although not the solution for all intellectual problems, is a great tool for demystifying notions that might otherwise remain ambiguous and flighty.
就拿飞行的概念本身来说吧。1903年,得益于汽车的蓬勃发展,航空工程风靡一时。当时还没有汽车修理厂( 1906年之前,“ garage ”这个词几乎不存在),但如果当时有的话,每个修理厂里都会挤满一位发明家,他们正忙着制造第一架飞机——一种比空气重的装置,可以依靠自身动力起飞并进行可控飞行。当时的机器不符合要求。它们要么无法起飞,要么立即坠毁。大多数发明家认为问题出在发动机上。只要他们能制造出足够强大的发动机,就能实现飞行的梦想。
Take the concept of flight itself. In 1903, thanks to the recent development of automobiles, aeronautical engineering was all the rage. There were no garages back then (the ngram for garage is virtually nonexistent prior to 1906), but if there had been, every one of them would have been filled with an inventor scrambling to build the first airplane, a heavier-than-air device that could take off under its own power and engage in controlled flight. Existing machines didn’t fit the bill. Either they couldn’t get off the ground or they crashed immediately. Most inventors believed that the problem was the engine. If only they could make an engine powerful enough, they could achieve the dream of flight.
但来自中西部的两位自行车修理工奥维尔和威尔伯却不这么认为。莱特兄弟认为真正的问题在于机翼。他们认为,如果没有一个像样的机翼,再好的发动机也无济于事。当时,已经存在大量关于机翼性能的数学理论。但当莱特兄弟研究这些理论时,他们意识到这些理论与他们在失败的试飞中看到的情况不符。他们认为,对于机翼来说,理论只能起到有限的作用。理论对物理世界做出了基本假设,而这些假设可能是错误的。所以问题不在于理论,而在于测量。他们需要的是一种通过实验研究飞机机翼空气动力学的方法——制造测试机翼并快速测量其工作情况。
But Orville and Wilbur, two bicycle mechanics from the Midwest, didn’t see it that way. The Wright brothers thought that the real problem was waiting in the wings. If you didn’t have a decent wing, they reasoned, a better engine wouldn’t help. At the time, there were already extensive mathematical theories about how wings should perform. But when the Wrights studied the theory, they realized that it didn’t match up with what they were seeing in their failed test flights. When it came to wings, they decided, theorizing could only take you so far. The theory made underlying assumptions about the physical world, and those assumptions might be wrong. So the problem was not one of theory, but of measurement. What they needed was a way to study the aerodynamics of airplane wings experimentally—to create test wings and to rapidly measure how well they worked.
因此,在激烈的竞争中,莱特兄弟甘冒了一次经过深思熟虑的风险。他们没有继续进行更多的飞行测试,而是躲在俄亥俄州代顿市自行车店的后面。在那里,他们花了几个月的时间制造了一个精确的机翼性能测量工具。最终,他们制造了一个小型汽油发动机,它能在一个相邻的六英尺长的木制腔室(风洞)中产生恒定的气流。利用风洞,莱特兄弟可以快速测量一个又一个机翼设计,精确确定每个翼型产生的升力和阻力。当然,他们在风洞中对翼型性能的测量是一种简化,不能完美模拟实际飞行中实际飞机上实际机翼的性能。但他们认为,有数据总比没有数据好。如果你的飞机不断坠毁,最好引入某种测量方法,而不是依赖直觉、勇气和一个好的灭火器。
So, amid intense competition, the Wright brothers took a calculated risk. Instead of plowing ahead with more flight tests, they holed up in the back of their bike shop in Dayton, Ohio. There they spent months building a precise measurement tool for wing performance. The result was a small gasoline motor creating constant airflow through an adjacent six-foot-long wooden chamber: a wind tunnel. Using their wind tunnel, the Wrights could quickly measure one wing design after another, precisely ascertaining how much lift and drag each airfoil produced. Of course, their measurements of the performance of airfoils in a wind tunnel were a simplification, an imperfect simulacrum of the actual performance of an actual wing on an actual plane in actual flight. But, they reasoned, data is better than no data. If your aeroplanes keep crashing, it’s better to introduce some sort of measurement than to rely on intuition, moxie, and a good fire extinguisher.
事实证明,他们的大胆举动至关重要,不仅使他们完善了理论,还对其进行了超越。正如威尔伯·莱特后来回忆的那样:
Their bold move turned out to be crucial, enabling them both to patch up the theory and to go beyond it. As Wilbur Wright later recalled:
我们很难低估在自制风洞中辛勤工作的价值。奥维尔和我把所有数据都记录成表格,最终打造出一架精准可靠的机翼。尽管我们的“飞行者”及其控制系统声名鹊起,但如果我们没有开发自己的风洞并得出正确的气动数据,这一切都不可能实现。
It is difficult to underestimate the value of that very laborious work we did over that homemade wind tunnel. From all the data that Orville and I accumulated into tables, an accurate and reliable wing could finally be built. As famous as we became for our “Flyer” and its system of control, it all would never have happened if we had not developed our own wind tunnel and derived our own correct aerodynamic data.
事实证明,莱特兄弟的风洞虽然简单,却难以捕捉到优秀机翼设计的关键要素。在他们的风洞里,莱特兄弟可以精确地他们逐一测量翼型的性能。根据测量数据,他们打造了一个高度优化的机翼,并将其安装到飞机上。1903年12月17日早晨,他们载入史册,翱翔天际。
It turned out that the Wrights’ wind tunnel—albeit simple—was not too simple to capture the important aspects of what made for a good wing design. In their tunnel, the brothers could precisely measure the performance of one airfoil after another. Based on the resulting data, they built a highly optimized wing and slapped it onto a plane. On the morning of December 17, 1903, they entered history, flying.
如果我们想了解名声,我们需要一个风洞。
If we want to understand fame, what we need is a wind tunnel.
名气的很多方面都难以衡量。比如失去匿名性,聚光灯下的压力,以及目睹明星光环消逝带来的心理冲击。
Many aspects of fame are difficult to measure. The loss of anonymity. The pressure of the spotlight. The psychological impact of watching your star wane.
但名气之大又如何呢?那种感觉耶稣比列侬更出名,列侬比鲍德温更出名,鲍德温比小林更出名的感觉又如何呢?或许,这里还有希望。毕竟,名气大小的一个重要方面是人们提及你的频率。而人们提及你的频率的一个重要方面是人们在书中提及你的频率。而说到书中人物的提及频率——嗯,ngrams 真的可以派上用场。
But what about the bigness of fame—that sense that Jesus is more famous than Lennon, who is more famous than Baldwin, who is more famous than Kobayashi? Here, perhaps, there is hope. After all, an important aspect of the magnitude of fame is how frequently people mention you. And an important aspect of how frequently people mention you is how frequently people mention you in books. And when it comes to mentions of people in books—well, ngrams can really come in handy for that.
当然,我们用ngrams衡量的并非名气本身,而是一种简化,一种名气的摹本。我们暂且称之为“phame”。问题是,phame是否足够接近名气,可以作为我们的风洞?
Of course, what we measure with ngrams is not fame itself but a simplification, a fame facsimile. Let’s call it “phame” for now. The question is, does phame resemble fame well enough to serve as our wind tunnel?
让我们从英国最著名的作家之一查尔斯·狄更斯开始探讨这个问题。他的第一部小说《匹克威克外传》始于1836年,最初是连载作品——一本以一系列小篇幅的形式在期刊上出版的书。随着《匹克威克外传》 的出版,仅重2克的查尔斯·狄更斯在书籍记录中开始崭露头角。就像莱特兄弟著名的飞行器一样,狄更斯的名声持续高涨,他创作了一系列畅销书,包括《雾都孤儿》(1837年)、《圣诞颂歌》(1843年)、《大卫·科波菲尔》(1849年)、《双城记》(1859年)和《远大前程》(1860年)。这些作品的文化影响力巨大。据说《圣诞颂歌》让“圣诞快乐”这个问候语流行起来,这一说法与ngram数据一致。
Let’s start exploring this question with a look at Charles Dickens, one of England’s most famous writers. His first novel, The Pickwick Papers, began in 1836 as a serial—a book published in a periodical as a series of small parts. With the publication of The Pickwick Papers, the 2-gram Charles Dickens begins to pick up speed in the book record. Like the Wright brothers’ famous Flyer, Dickens’ phame just kept on rising as he produced a steady stream of bestsellers, including Oliver Twist (1837), A Christmas Carol (1843), David Copperfield (1849), A Tale of Two Cities (1859), and Great Expectations (1860). The cultural impact of these works was enormous. It is said that A Christmas Carol popularized the greeting Merry Christmas, a report that is consistent with the ngram data.
和狄金森一样,狄更斯1870年的去世并没有让他的名声衰落。相反,他的名声却一路飙升,因为他去世的消息引发了人们对他才华的全新认识。在他去世后的几十年里,他被提及的频率达到了顶峰。但到了1900年,这个只有两克的“查尔斯·狄更斯”的名字却开始慢慢衰落。尽管即使在今天,狄更斯仍然享有盛誉,是学者们深入研究的对象,也是高中课程的必修课,但他的名声显然正在衰落。这种情况已经持续了一个多世纪。
As with Dickinson, Dickens’ death in 1870 did not cause his phame to ebb. Instead, it skyrocketed, as word of his passing brought on a newfound appreciation for his genius. In the decades after his death, his frequency of mention reached its very peak. But by 1900, the 2-gram Charles Dickens had begun a slow decline. Despite being extraordinarily phamous even today, the subject of intense scholarly examination, and a staple of high school curricula, Dickens’ phame is plainly on the wane. It has been for over a century.
将查尔斯·狄更斯放入我们的风洞有趣的结果——对狄更斯的成就所引起的公众兴趣的合理衡量。
Putting Charles Dickens into our wind tunnel produced interesting results—a plausible measurement of the public interest that resulted from Dickens’ achievements.
但前景并非完全乐观。我们的例子也凸显了名人(用书籍衡量)和名望(反映在我们对文化重要性的直觉观念中)并不总是相辅相成的重要方面。所有测量工具都会出错。为了更好地理解这里发生的事情,了解一些误差分析理论会有所帮助。误差分析理论是统计学中一个成熟的分支,它研究测量中所有可能出错的方式。
But the outlook is not completely rosy. Our example also helps highlight some of the important ways in which phame, as measured using books, and fame, as reflected in our intuitive notions of cultural importance, don’t always get along famously. All measurement devices make mistakes. To better understand what’s going on here, it helps to know a little bit about the theory of error analysis, a well-developed branch of statistics that deals with all the ways in which a measurement can go wrong.
统计学家区分了测量设备可能产生的两种误差。第一种称为随机误差:即使测量对象没有变化,也会出现波动。我们可以在相位图中看到这种误差,它们以小峰和小谷的形式出现,虽然普遍存在,但通常意义不大。随机误差的好处在于,尽管曲线会波动,但它通常接近真实值。
Statisticians distinguish between two types of error that a measurement device can make. The first type is called random error: fluctuations that occur even if what is being measured is not changing. We can see such errors in the form of small peaks and valleys in phame, which, though ubiquitous, are often not meaningful. The good thing about random error is that, although the curve wiggles around, it typically stays close to the true value.
所谓的系统性误差更加棘手。这些误差通常会使测量结果向某个方向倾斜,要么增加要么减少。例如,我们测量人名的程序是搜索人名的实例。但这只能捕获所有引用中的一小部分。如果我们追踪查尔斯·狄更斯 (Charles Dickens)的频率,我们会错过人们仅称他为“狄更斯”或“查理”或“C-Money”的情况。如果他们称他为“匹克威克外传的作者”或“凯瑟琳·霍加斯的丈夫”,我们也不会发现。当然,如果有人通过引用狄更斯最喜欢的段落,或称赞魔术师大卫·科波菲尔的魔术,或甚至只是使用短语“圣诞快乐”来提及狄更斯的遗产,我们也会错过。
So-called systematic errors are trickier. These errors typically skew the measurement in a given direction, either inflating or reducing it. For instance, our procedure for measuring phame is to search for instances of a person’s name. But this captures only a small fraction of all references. If we’re tracking the frequency of Charles Dickens, we miss cases in which people refer to him as just “Dickens” or “Charlie” or “C-Money.” If they refer to him as “the author of The Pickwick Papers” or “the husband of Catherine Hogarth,” we won’t catch it, either. And of course, if someone makes a reference to Dickens’ legacy by quoting a favorite passage, or admiring a trick by illusionist David Copperfield, or even just using the phrase Merry Christmas, we miss that, too.
一个很好的例子,说明了捕捉每一个一次狄更斯的引用发生在2011年,共和党全国委员会主席候选人迈克尔·斯蒂尔在一场电视辩论中被问及他最喜欢的书是什么。斯蒂尔的回答令人尴尬地失态:“《战争与和平》……最好的时代和最坏的时代。”这句话是狄更斯《双城记》中被篡改的版本。但《战争与和平》的作者却是列夫·托尔斯泰。斯蒂尔指的是狄更斯吗?
A great example of the difficulty involved in catching every single Dickens reference occurred when Michael Steele, running for chair of the Republican National Committee, was asked to name his favorite book during a televised debate in 2011. Steele’s answer was an embarrassing gaffe: “War and Peace . . . the best of times and the worst of times.” The quote is mangled Dickens, from A Tale of Two Cities. But War and Peace is by Leo Tolstoy. Was Steele referring to Dickens or wasn’t he?
这类错误——我们忽略了理想情况下想要捕捉到的某些信息——属于系统性错误,统计学家称之为“假阴性”。由于这些假阴性,我们报告的短语通常远低于提及该人的真实频率。
These types of errors—when we neglect something we’d ideally want to catch—are a class of systematic error that statisticians call a false negative. As a result of our false negatives, the phame we report is typically much lower than the true frequency of references to a person.
还有一种系统性错误,称为假阳性。当我们统计了一些不该统计的数字时,就会发生这种情况。写下“查尔斯·狄更斯”几个字的人,实际上可能指的是狄更斯的长子,作家小查尔斯·狄更斯;他的孙子杰拉尔德·查尔斯·狄更斯;他的两个曾孙,塞德里克·查尔斯·狄更斯和彼得·杰拉尔德·查尔斯·狄更斯;或者他的玄孙,演员杰拉尔德·查尔斯·狄更斯。Phame 把这一切都归咎于家族族长。但统计学家知道这可能很危险。没有一位统计学家比加州大学伯克利分校的迈克尔·I·乔丹教授更深刻地理解这个问题。要了解原因,请谷歌搜索迈克尔·乔丹统计数据。
There is another type of systematic error, called a false positive. This occurs when we count something that we really should not. Someone writing the words Charles Dickens may in fact be referring to Dickens’ eldest son, the author Charles Dickens, Jr.; his grandson Gerald Charles Dickens; two of his great-grandsons, Cedric Charles Dickens and Peter Gerald Charles Dickens; or his great-great-grandson, the actor Gerald Charles Dickens. Phame chalks it all up to the family patriarch. But statisticians know that this can be perilous. No statistician understands this issue more deeply than a professor at UC Berkeley named Michael I. Jordan. To see why, Google Michael Jordan statistics.
但我们尚未触及我们的技术所引发的最复杂的统计问题。
But we’ve yet to broach the most complex statistical issue raised by our technique.
想想 1936 年。许多名人都出生于 1936 年。其中两位是罗伯特·雷德福和瓦茨拉夫·哈维尔。
Consider the year 1936. Many famous people were born in 1936. Two of them are Robert Redford and Václav Havel.
罗伯特·雷德福是典型的好莱坞明星。过去五十年来,他扮演了众多标志性角色,在电影中的精彩表演激励了数亿观众,例如《走出非洲》、《骗中骗》和《总统班底》。他粗犷英俊的外表使他成为美国最受喜爱的文化人物之一,闻名世界。
Robert Redford is the quintessential Hollywood star. He has played iconic roles in films for the last five decades, inspiring hundreds of millions of people with his performances in movies like Out of Africa, The Sting, and All the President’s Men. His rugged good looks have made him one of America’s best-loved cultural figures, known the world over.
瓦茨拉夫·哈维尔是一位与众不同的名人。他是一位低调的剧作家,在天鹅绒革命期间带领捷克斯洛伐克脱离共产主义,并成为其首任总统。四年后,他主持了捷克共和国和斯洛伐克共和国的和平分离。哈维尔是二十世纪最著名的政治和文学人物之一。
Václav Havel is a different breed of celebrity. He was a quiet playwright who led Czechoslovakia out of communism during the Velvet Revolution, becoming its first president. Four years later, he presided over the peaceful separation of the Czech and Slovak republics. Havel is one of the most famous political and literary figures of the twentieth century.
他们俩都位列1936年出生的十大名人之列。但他们在榜单上却被挤到了第一位。那么,谁是1936年出生的十大名人呢?一位名叫卡罗尔·吉利根。
Both of them are among the ten most phamous people born in 1936. But they are edged out for the spot at the top of the list. Who, then, is the most phamous person born in 1936? A woman named Carol Gilligan.
吉利根是一位著名的心理学家和杰出的女权主义者,她开创性的工作使她获得了哈佛大学、剑桥大学以及现在的纽约大学的职位。和平克一样,她也曾《时代》杂志评选的美国最具影响力人物。她是一位知识界的超级巨星。书籍中多次提及卡罗尔·吉利根,比瓦茨拉夫·哈维尔或罗伯特·雷德福的提及次数还要多。如果“phame”(名望)和“fame”(名声)完全一样,那么最著名的当属这位博学的女士了。
Gilligan is a renowned psychologist and a prominent feminist, whose groundbreaking work has led to positions at Harvard, Cambridge, and now New York University. Like Pinker, she’s been on Time’s list of the most influential Americans. She is an intellectual superstar. Books mention Carol Gilligan a whole lot, a bit more often than either Václav Havel or Robert Redford. If phame and fame were exactly the same, then the most famous of all would be the scholarly dame.
但让我们现实一点。卡罗尔·吉利根并不比罗伯特·雷德福更出名。她在书中被提及得更多,因为她正是那些写书的人倾向于考虑的那种人:一位科学名人和社会评论家。但她不是那种每天都上头条新闻的人,不是那种形象很可能在公交车旁擦肩而过的人,也不是那种让数百万少女为之倾倒的人。
But let’s get real. Carol Gilligan is not more famous than Robert Redford. She’s talked about more in books, because she’s exactly the type of person that the type of person who writes books tends to think about: a science celebrity and a social critic. But she’s not the type of person who makes headline news every day, not the type of person whose image is likely to pass by on the side of a bus, and not the type of person who makes teenage girls fawn by the millions.
问题在于,phame 无法捕捉到更广阔的视野。如果把电视新闻、小报、网络名人网站以及办公室茶歇时间的提及都算上,哈维尔和雷德福的知名度肯定远超吉利根,而且差距不小。吉利根受益于统计学家所谓的“抽样偏差”——phame 衡量的文化因素赋予了她不公平的优势。她比她本身更有名气,更出名。
The problem is that phame doesn’t capture this bigger picture. If you were to take into account mentions on TV news, mentions in tabloids, mentions on Internet celebrity sites, and mentions around the office water cooler, Havel and Redford surely eclipse Gilligan, and by no small margin. Gilligan is benefiting from what statisticians call sampling bias—the aspect of culture that phame measures gives her an unfair advantage. She is more phamous than she is famous.
我们的风洞并非完美无缺,但这些缺陷并非个例。它们属于任何测量工具都会出现的典型误差类别,科学家和统计学家几十年来一直在应对这些误差。牢记这些缺陷将有助于未来开发出更好的工具。
Our wind tunnel is not without its flaws. But these faults are not unique. Instead, they fall into classic error categories that arise with any measurement tool and that scientists and statisticians have been dealing with for decades. Bearing these imperfections in mind will make it possible to develop better tools in the future.
名人与名望之间的关系很好地体现了我们通常的做法。像名望这样日常生活中常见的概念过于复杂,定义也过于模糊,难以量化。因此,我们寻找可以衡量的东西,比如phame,尽可能接近原始概念。最终结果是一个折衷方案,一个名人模仿者,我们可以用它作为实验对象,并对其进行仔细的实验。随着更完善的数据集出现,涵盖了小报、杂志和学术文章等内容,我们测量的 phame 将会过时,更复杂的替代方案将会被开发出来。莱特兄弟的风洞与如今用于产生 30 马赫风力以测试新型航天器的 LENS-X 涡轮机相比,显得黯然失色。
The relationship between phame and fame is a good illustration of our general approach. An ordinary concept from everyday life, like fame, is too complex and too imprecisely defined to be quantifiable. So we search for things that we can measure, like phame, that are as close to the original concept as possible. The result is a compromise, a celebrity impersonator that we can use as our guinea pig and that we can subject to careful experimentation. As better datasets emerge that incorporate things like tabloids, magazines, and scholarly articles, phame as we measure it will become obsolete, and more sophisticated alternatives will be developed. The Wrights’ wind tunnel would pale in comparison to the LENS-X turbines used today to generate Mach 30 winds for testing new spacecraft.
但就目前而言,“名声”已经是一个不错的开始。事实上,好到我们不再纠结于两者的区别;为了简单起见,我们就把所有东西都叫做“名声”。几乎出名就足够出名了。
But for now, phame is a pretty good start. So good, in fact, that we’re not going to dwell on the distinction any longer; to keep things simple, we’re just going to call everything fame. Almost famous is famous enough.
有了新的风洞,我们能从中了解到哪些关于人体起飞的空气动力学知识?又能了解人体落回地面时复杂的力学原理吗?
Equipped with our new wind tunnel, what can we learn about the aerodynamics of a person’s takeoff? And about the grim mechanics of the fall back to earth?
当我们开始使用ngram数据研究名气时,我们很快意识到每个故事都是不同的。当我们试图找出规律时,结果似乎难以解释,甚至自相矛盾。我们陷入了数据的无底深渊。
As we began to study fame using the ngram data, we quickly realized that every story was different. When we tried to identify patterns, the results seemed hard to explain and even self-contradictory. We were stuck in a bottomless pit of data.
要了解我们为何陷入困境,我们需要穿越时空回到1930年,来到挪威的一个小镇克里斯蒂安桑。在那里,一位名叫克里斯蒂安·安德沃德的当地医生正在努力理解这场正在摧毁他的病人和整个国家的流行病。安德沃德正在研究结核病,这种疾病在挪威的肆虐程度,我们今天或许难以想象。在挪威例如,在特隆赫姆市,1887年至1891年间出生的婴儿中,超过1%在满一岁之前死于肺结核。在11至15岁的儿童中,几乎一半的死亡都与肺结核有关。
To see why we were stuck, we need to take a trip through time to 1930, to a little town in Norway called Kristiansand. There, a local doctor named Kristian Andvord was struggling to understand the epidemic that was devastating his patients and his nation. Andvord was studying tuberculosis, which afflicted Norway to an extent we might find hard to fathom today. In the Norwegian city of Trondheim, for instance, more than 1 percent of babies born between 1887 and 1891 died of tuberculosis before reaching their first birthday. Among children between the ages of eleven and fifteen, nearly half of all deaths were attributable to the disease.
当时,很容易看出一些奇怪的事情正在发生。随着这场持续数十年的疫情持续蔓延,挪威结核病患者的平均年龄正在上升。这怎么可能呢?
At the time, it was easy to see that something peculiar was going on. As the decades-long epidemic wore on, the average age of Norwegian tuberculosis victims was increasing. How could that be?
安德沃德(或者,根据一个杜撰的故事,是一位与他一起工作的护士)想出了一个主意。他不打算研究整个人口随时间推移的疾病发病率,而是应该将人口分成几个队列,即出生时间大致相同的人群。这种方法的优点在于,通过控制出生年份,他可以更好地解释一些误导性的影响,例如一场饥荒可能只影响了一代儿童。缺点在于,这种方法所需的数据量远远超出了克里斯蒂安桑小镇所能收集到的数据。
Andvord (or, according to an apocryphal story, a nurse working with him) had an idea. Instead of studying disease incidence over time in the entire population, he should break the population up into cohorts, groups of people who were born at roughly the same time. The advantage of this approach was that by controlling for birth year, he could do a far better job of accounting for misleading effects, such as a famine that might have only affected a single generation of children. The disadvantage was that this approach required a lot more data than could possibly be collected in the little town of Kristiansand.
和齐普夫一样,安德沃德也踏上了寻找数据的道路。对于安德沃德和医学史来说,挪威政府在追踪死亡率统计数据方面一直非常谨慎,这让他感到非常幸运。安德沃德获得了涵盖1896年至1927年整个时期的政府数据。他补充了来自英格兰、威尔士、丹麦和瑞典的额外数据集,补充了挪威的数据。有了这些丰富的信息,安德沃德现在可以提出并回答那些以前困扰他的简单问题了。例如,1900年出生的人(1900年出生的人)在什么年龄最有可能死于肺结核?1910年出生的人呢?1920年出生的人呢?
Like Zipf, Andvord hit the road on a quest for data. To the great fortune of Andvord and of medical history, the Norwegian government had been meticulous in its efforts to track mortality statistics. Andvord was able to get government data covering the entire period from 1896 through 1927. He supplemented the Norway results with additional datasets from England, Wales, Denmark, and Sweden. Armed with this wealth of information, Andvord could now ask and answer the simple questions that had stymied him before. For instance, at what age were the people born in 1900 (the 1900 cohort) most likely to die of tuberculosis? What about the 1910 cohort? What about the 1920 cohort?
他得到的答案令人震惊。原来,无论出生年份如何,结核病患者最容易在5至14岁或20至24岁之间感染结核病。安沃德的队列分析显示,结核病主要是一种年轻人疾病,而且一直以来都是如此。
The answers he obtained were astonishing. It turned out that, regardless of their year of birth, disease victims were most likely to contract tuberculosis between the ages of five and fourteen or between the ages of twenty and twenty-four. Andvord’s cohort analysis revealed that tuberculosis was primarily a disease of the young, and had been all along.
但如果真是这样,那么纵观整个人口,结核病患者的平均年龄怎么会随着时间的推移而增长呢?安沃德在研究该疾病的总发病率时获得了关键的洞见——特定群体的成员在其一生中的某个阶段死于结核病的可能性,无论年轻还是年老。随着安沃德研究越来越年轻的群体,他发现总发病率越来越低。1920 年出生的挪威人一生中感染结核病的可能性小于 1910 年出生的挪威人,而 1910 年出生的挪威人感染结核病的可能性又小于 1900 年出生的挪威人,依此类推。
But if so, how could it be that if one looked at the entire population, the average age of tuberculosis victims was increasing over time? The crucial insight came when Andvord examined the total incidence of the disease—the likelihood that a member of a particular cohort would die of tuberculosis at some point in their life, young or old. As Andvord examined younger and younger cohorts, he found that the total incidence got lower and lower. Norwegians born in 1920 were less likely to contract tuberculosis in their lifetimes than Norwegians born in 1910, who were in turn less likely to contract tuberculosis than Norwegians born in 1900, and so on.
这让关于年龄的普遍发现有了不同的解读。并非这种疾病的目标人群越来越老,而是出生较早的人一生中更容易感染结核病。这些发现的直接后果是医学界的重磅炸弹:挪威年轻人对结核病的抵抗力正在逐渐增强。这场疫情就像一场残酷却有效的大规模疫苗接种运动。
This cast the common finding about age in a different light. It wasn’t that the disease was targeting increasingly old people; it was that people born earlier were more vulnerable to contracting tuberculosis throughout their whole lives. The immediate consequence of these findings was a medical bombshell: Young Norwegians were becoming progressively more resistant to tuberculosis. The epidemic was functioning as a murderous but effective mass vaccination campaign.
尽管完全出乎意料,安沃德的惊人结论最终被证明是正确的。这并非他唯一的遗产。安沃德的队列研究方法是一项革命性的洞见,如今已成为流行病学和公共卫生领域不可或缺的科学工具。凡是需要汇总海量公共卫生数据集的地方,安沃德的思想都可能发挥作用。对安沃德(或者说,感谢他的护士,我们了解到高血压与心血管疾病之间的关联、吸烟与肺癌之间的关联、血糖与糖尿病之间的关联,以及数以万计的其他关联,这些关联让我们的每一个饮食决定都充满负罪感。
Though completely unexpected, Andvord’s astonishing conclusions proved to be correct. This was not his only legacy. Andvord’s cohort method was a revolutionary insight that has become an essential scientific tool for epidemiology and public health. Wherever massive datasets about public health are being aggregated, Andvord’s ideas are likely at work. It is to Andvord (or, possibly, his nurse) that we are indebted for such knowledge as the association between high blood pressure and cardiovascular disease, the association between cigarette smoking and lung cancer, the association between blood sugar and diabetes, and tens of thousands of other associations which ensure that our every dietary decision is riddled with guilt.
与结核病研究一样,名气研究也因各种代际效应而受阻。例如,互联网的发明极大地影响了人们如何成为名人。在我们最初的研究中,这些代际效应使得我们很难看清究竟发生了什么。
Like studies of tuberculosis, studies of fame are confounded by all sorts of generation-specific effects. For instance, the invention of the Internet has dramatically influenced how people become celebrities. In our initial research, these generation-specific effects made it extremely hard to see what was going on.
最后,我们做了任何一位优秀的数据科学家都应该做的事情。我们问自己:“安德沃德会怎么做?”(WWAD?)突然间,解决方案清晰起来。我们应该使用群组方法。我们应该像对待疾病一样对待名声。
Finally, we did what any good data scientist should have done in the first place. We asked ourselves, WWAD? (“What Would Andvord Do?”) Suddenly, the solution became clear. We should use the cohort method. We should treat fame like a disease.
当时,我们刚认识 Adrian Veres。Adrian 本科时就非常优秀,他对不朽的名声颇有体会:在英特尔国际科学与工程大奖赛上获得第一名后,已经有一颗小行星以他的名字命名。21758 阿德里安维雷斯。
At the time, we had just met Adrian Veres. A truly stellar undergraduate, Adrian knew a thing or two about immortal fame: For winning first place in the Intel International Science and Engineering Fair, he had already had a minor planet named after him, 21758 Adrianveres.
我们与阿德里安合作,开始创建队列,其中包括每一代受名声影响最严重的人:马克·吐温家族、甘地家族、罗斯福家族。我们选择研究出生于1800年至1950年之间的人。此前,我们曾深入研究过数据集中那些我们数据质量并不是最好的。后来,我们无法在足够长的时间内追踪名气:1950 年出生的人通常直到 80 年代或 90 年代才会出名,我们只能得到几年的可用数据。艾德里安分析了数十万人,计算了他们全名被提及的频率(例如,马克吐温)。对于 1800 年至 1950 年间的每一年,他都列出了当年出生的 50 位最著名的人。考虑到艾德里安在他的星球上刚满 6 岁,这项工作尤其令人印象深刻。如果名人是一种疾病,那么艾德里安的名单包含了 7,500 名最严重的受害者。
Working with Adrian, we began to create cohorts consisting of the individuals in each generation who were the most severely afflicted by fame: the Twains, the Gandhis, the Roosevelts. We chose to study people born between 1800 and 1950. Earlier, and we would be wading into parts of the dataset where our data quality was not at its very best. Later, and we would be unable to track fame across a sufficiently long period: Someone born in 1950 frequently won’t get famous until the ’80s or ’90s, giving us only a handful of years’ worth of usable data. Adrian analyzed hundreds of thousands of people, computing the frequency of mention of their full names (for example, Mark Twain). For each year between 1800 and 1950, he generated a list of the fifty most famous people born that year. It was particularly impressive work, considering that Adrian had just turned six on his home planet. If celebrity is a disease, Adrian’s lists contained its 7,500 worst victims.
这些团体一群激动人心的人物揭示了通往成名的多种途径。以1871年出生的那批人为例。1871年出生的50位最著名的人中包括我们的灵感来源奥维尔·莱特,他因学会飞行而声名鹊起。欧内斯特·卢瑟福因其非凡的散射实验而闻名,该实验揭示了原子核的存在。马塞尔·普鲁斯特则因写出好书而闻名。
The groups were an exciting set of people who revealed the many diverse paths to fame. Take the cohort, or class, of 1871. The fifty most famous people born in 1871 included Orville Wright, our inspiration, who became famous when he learned how to fly. Ernest Rutherford became famous for his remarkable scattering experiments, which revealed the existence of the atomic nucleus. And Marcel Proust became famous for writing good books.
毕业典礼致辞者——出生于1871年的最著名人物——是科德尔·赫尔。没听说过他?如今他名声大噪,但赫尔在巅峰时期可谓是一位举足轻重的人物。赫尔曾担任美国参议员,最终成为美国历史上任职时间最长的国务卿。他在富兰克林·德拉诺·罗斯福手下任职十一年,期间经历了二战的激烈时期。此外,赫尔在联合国的创立中发挥了巨大作用,并因此荣获诺贝尔和平奖。罗斯福本人甚至称赫尔为“联合国之父”。这位班长确实名副其实。
The class valedictorian—the most famous person born in 1871—was Cordell Hull. Never heard of him? He’s much less known now, but in his heyday, Hull was a titanic figure. A United States senator, Hull eventually became the longest-serving secretary of state. His eleven years under Franklin Delano Roosevelt spanned the height of World War II. Among other things, Hull played a huge role in founding the United Nations, for which he was honored with the Nobel Peace Prize. Roosevelt himself referred to Hull as “the Father of the United Nations.” The head of the class really made good.
每个班级都包含着类似的精彩人生故事。1904届的毕业生包括智利诗人巴勃罗聂鲁达、超现实主义画家萨尔瓦多·达利,以及制造了第一颗原子弹的曼哈顿计划的领导人罗伯特·奥本海默。该计划的毕业生代表是中国领导人邓小平。1899年的毕业生代表是欧内斯特·海明威;这一届毕业生代表还包括阿根廷作家豪尔赫·路易斯·博尔赫斯、演员弗雷德·阿斯坦和亨弗莱·鲍嘉、标志性导演阿尔弗雷德·希区柯克以及黑帮老大阿尔·卡彭。邀请你参加这样的团聚晚宴,绝对是难以拒绝的。
Each and every class comprises a similar pastiche of fascinating life stories. The class of 1904 includes the Chilean poet Pablo Neruda, the Surrealist painter Salvador Dalí, and Robert Oppenheimer, leader of the Manhattan Project, which built the first atom bomb. Its valedictorian is Deng Xiaoping, the Chinese leader. The 1899 valedictorian is Ernest Hemingway; that class includes the Argentine writer Jorge Luis Borges, the actors Fred Astaire and Humphrey Bogart, the iconic director Alfred Hitchcock, and the gangster Al Capone. An invitation to that reunion dinner would be an offer you can’t refuse.
下表列出了150位毕业生代表。看看你认识多少个名字。你可以把这看作是你参加过的最客观的历史考试。这些名字并不代表我们对你应该了解哪些人的看法,也不代表老师、教授或世界历史学术权威的意见。相反,它们反映了自1800年以来所有用英语写过书的人的总体看法。
The 150 valedictorians are listed in the table that follows. See how many of the names you recognize. You can think of this as the most objective history test you’ll ever take. These names don’t reflect our opinion of whom you should know about, or the opinion of a teacher or professor or scholarly authority on world history. Instead, they reflect the aggregate opinion of everyone who has written a book in English since 1800.
|
1800 1800 |
乔治·班克罗夫特 George Bancroft |
|
1801 1801 |
杨百翰 Brigham Young |
|
1802 1802 |
维克多·雨果 Victor Hugo |
|
1803 1803 |
拉尔夫·沃尔多·爱默生 Ralph Waldo Emerson |
|
1804 1804 |
乔治·桑 George Sand |
|
1805 1805 |
威廉·劳埃德·加里森 William Lloyd Garrison |
|
1806 1806 |
约翰·斯图尔特·密尔 John Stuart Mill |
|
1807 1807 |
路易斯·阿加西 Louis Agassiz |
|
1808 1808 |
拿破仑三世 Napoleon III |
|
1809 1809 |
亚伯拉罕·林肯 Abraham Lincoln |
|
1810 1810 |
利奥十三世 Leo XIII |
|
1811 1811 |
霍勒斯·格里利 Horace Greeley |
|
1812 1812 |
查尔斯·狄更斯 Charles Dickens |
|
1813 1813 |
亨利·沃德·比彻 Henry Ward Beecher |
|
1814 1814 |
查尔斯·里德 Charles Reade |
|
1815 1815 |
安东尼·特罗洛普 Anthony Trollope |
|
1816 1816 |
拉塞尔·塞奇 Russell Sage |
|
1817 1817 |
亨利·戴维·梭罗 Henry David Thoreau |
|
1818 1818 |
卡尔·马克思 Karl Marx |
|
1819 1819 |
乔治·艾略特 George Eliot |
|
1820 1820 |
赫伯特·斯宾塞 Herbert Spencer |
|
1821 1821 |
玛丽·贝克·艾迪 Mary Baker Eddy |
|
1822 1822 |
马修·阿诺德 Matthew Arnold |
|
1823 1823 |
戈德温·史密斯 Goldwin Smith |
|
1824 1824 |
石墙杰克逊 Stonewall Jackson |
|
1825 1825 |
贝亚德·泰勒 Bayard Taylor |
|
1826 1826 |
沃尔特·白芝浩 Walter Bagehot |
|
1827 1827 |
查尔斯·艾略特·诺顿 Charles Eliot Norton |
|
1828 1828 |
乔治·梅雷迪斯 George Meredith |
|
1829 1829 |
卡尔·舒尔茨 Carl Schurz |
|
1830 1830 |
艾米莉·狄金森 Emily Dickinson |
|
1831 1831 |
坐牛 Sitting Bull |
|
1832 1832 |
莱斯利·斯蒂芬 Leslie Stephen |
|
1833 1833 |
埃德温·布斯 Edwin Booth |
|
1834 1834 |
威廉·莫里斯 William Morris |
|
1835 1835 |
马克·吐温 Mark Twain |
|
1836 1836 |
布雷特·哈特 Bret Harte |
|
1837 1837 |
格罗弗·克利夫兰 Grover Cleveland |
|
1838 1838 |
约翰·莫利 John Morley |
|
1839 1839 |
亨利·乔治 Henry George |
|
1840 1840 |
疯马 Crazy Horse |
|
1841 1841 |
爱德华七世 Edward VII |
|
1842 1842 |
阿尔弗雷德·马歇尔 Alfred Marshall |
|
1843 1843 |
亨利·詹姆斯 Henry James |
|
1844 1844 |
阿纳托尔·法朗士 Anatole France |
|
1845 1845 |
伊莱胡·鲁特 Elihu Root |
|
1846 1846 |
布法罗比尔 Buffalo Bill |
|
1847 1847 |
艾伦·特里 Ellen Terry |
|
1848 1848 |
格兰特·艾伦 Grant Allen |
|
1849 1849 |
埃德蒙·戈斯 Edmund Gosse |
|
1850 1850 |
罗伯特·路易斯·史蒂文森 Robert Louis Stevenson |
|
1851 1851 |
奥利弗·洛奇 Oliver Lodge |
|
1852 1852 |
布兰德·马修斯 Brander Matthews |
|
1853 1853 |
塞西尔·罗兹 Cecil Rhodes |
|
1854 1854 |
奥斯卡·王尔德 Oscar Wilde |
|
1855 1855 |
乔赛亚·罗伊斯 Josiah Royce |
|
1856 1856 |
伍德罗·威尔逊 Woodrow Wilson |
|
1857 1857 |
庇护十一世 Pius XI |
|
1858 1858 |
西奥多·罗斯福 Theodore Roosevelt |
|
1859 1859 |
约翰·杜威 John Dewey |
|
1860 1860 |
简·亚当斯 Jane Addams |
|
1861 1861 |
泰戈尔 Rabindranath Tagore |
|
1862 1862 |
爱德华·格雷 Edward Grey |
|
1863 1863 |
大卫·劳合·乔治 David Lloyd George |
|
1864 1864 |
马克斯·韦伯 Max Weber |
|
1865 1865 |
拉迪亚德·吉卜林 Rudyard Kipling |
|
1866 1866 |
拉姆齐·麦克唐纳 Ramsay MacDonald |
|
1867 1867 |
阿诺德·贝内特 Arnold Bennett |
|
1868 1868 |
威廉·艾伦·怀特 William Allen White |
|
1869 1869 |
安德烈·纪德 André Gide |
|
1870 1870 |
弗兰克·诺里斯 Frank Norris |
|
1871 1871 |
科德尔·赫尔 Cordell Hull |
|
1872 1872 |
室利·奥罗宾多 Sri Aurobindo |
|
1873 1873 |
艾尔·史密斯 Al Smith |
|
1874 1874 |
温斯顿·丘吉尔 Winston Churchill |
|
1875 1875 |
托马斯·曼 Thomas Mann |
|
1876 1876 |
庇护十二世 Pius XII |
|
1877 1877 |
伊莎多拉·邓肯 Isadora Duncan |
|
1878 1878 |
卡尔·桑德堡 Carl Sandburg |
|
1879 1879 |
阿尔伯特·爱因斯坦 Albert Einstein |
|
1880 1880 |
道格拉斯·麦克阿瑟 Douglas MacArthur |
|
1881 1881 |
皮埃尔·特伊哈德·德·夏尔丹 Pierre Teilhard de Chardin |
|
1882 1882 |
弗吉尼亚·伍尔夫 Virginia Woolf |
|
1883 1883 |
威廉·卡洛斯·威廉姆斯 William Carlos Williams |
|
1884 1884 |
哈里·杜鲁门 Harry Truman |
|
1885 1885 |
埃兹拉·庞德 Ezra Pound |
|
1886 1886 |
范威克·布鲁克斯 Van Wyck Brooks |
|
1887 1887 |
鲁珀特·布鲁克 Rupert Brooke |
|
1888 1888 |
约翰·福斯特·杜勒斯 John Foster Dulles |
|
1889 1889 |
贾瓦哈拉尔·尼赫鲁 Jawaharlal Nehru |
|
1890 1890 |
胡志明 Ho Chi Minh |
|
1891 1891 |
胡适 Hu Shih |
|
1892 1892 |
莱因霍尔德·尼布尔 Reinhold Niebuhr |
|
1893 1893 |
毛泽东 Mao Zedong |
|
1894 1894 |
阿道司·赫胥黎 Aldous Huxley |
|
1895 1895 |
乔治六世 George VI |
|
1896 1896 |
约翰·多斯·帕索斯 John Dos Passos |
|
1897 1897 |
威廉·福克纳 William Faulkner |
|
1898 1898 |
贡纳尔·默达尔 Gunnar Myrdal |
|
1899 1899 |
欧内斯特·海明威 Ernest Hemingway |
|
1900 1900 |
阿德莱·史蒂文森 Adlai Stevenson |
|
1901 1901 |
玛格丽特·米德 Margaret Mead |
|
1902 1902 |
塔尔科特·帕森斯 Talcott Parsons |
|
1903 1903 |
乔治·奥威尔 George Orwell |
|
1904 1904 |
邓小平 Deng Xiaoping |
|
1905 1905 |
让·保罗·萨特 Jean-Paul Sartre |
|
1906 1906 |
汉娜·阿伦特 Hannah Arendt |
|
1907 1907 |
劳伦斯·奥利弗 Laurence Olivier |
|
1908 1908 |
林登·约翰逊 Lyndon Johnson |
|
1909 1909 |
巴里·戈德华特 Barry Goldwater |
|
1910 1910 |
特蕾莎修女 Mother Teresa |
|
1911 1911 |
罗纳德·里根 Ronald Reagan |
|
1912 1912 |
米尔顿·弗里德曼 Milton Friedman |
|
1913 1913 |
理查德·尼克松 Richard Nixon |
|
1914 1914 |
迪伦·托马斯 Dylan Thomas |
|
1915 1915 |
罗兰·巴特 Roland Barthes |
|
1916 1916 |
C. 赖特·米尔斯 C. Wright Mills |
|
1917 1917 |
英迪拉·甘地 Indira Gandhi |
|
1918 1918 |
葛培理 Billy Graham |
|
1919 1919 |
丹尼尔·贝尔 Daniel Bell |
|
1920 1920 |
欧文·豪 Irving Howe |
|
1921 1921 |
雷蒙德·威廉姆斯 Raymond Williams |
|
1922 1922 |
乔治·麦戈文 George McGovern |
|
1923 1923 |
亨利·基辛格 Henry Kissinger |
|
1924 1924 |
吉米·卡特 Jimmy Carter |
|
1925 1925 |
罗伯特·肯尼迪 Robert Kennedy |
|
1926 1926 |
菲德尔·卡斯特罗 Fidel Castro |
|
1927 1927 |
加夫列尔·加西亚·马尔克斯 Gabriel García Márquez |
|
1928 1928 |
切·格瓦拉 Che Guevara |
|
1929 1929 |
小马丁·路德·金 Martin Luther King, Jr. |
|
1930 1930 |
雅克·德里达 Jacques Derrida |
|
1931 1931 |
米哈伊尔·戈尔巴乔夫 Mikhail Gorbachev |
|
1932 1932 |
西尔维娅·普拉斯 Sylvia Plath |
|
1933 1933 |
苏珊·桑塔格 Susan Sontag |
|
1934 1934 |
拉尔夫·纳德 Ralph Nader |
|
1935 1935 |
埃尔维斯·普雷斯利 Elvis Presley |
|
1936 1936 |
卡罗尔·吉利根 Carol Gilligan |
|
1937 1937 |
萨达姆·侯赛因 Saddam Hussein |
|
1938 1938 |
安东尼·吉登斯 Anthony Giddens |
|
1939 1939 |
李·哈维·奥斯瓦尔德 Lee Harvey Oswald |
|
1940 1940 |
约翰·列侬 John Lennon |
|
1941 1941 |
鲍勃·迪伦 Bob Dylan |
|
1942 1942 |
芭芭拉·史翠珊 Barbra Streisand |
|
1943 1943 |
特里·伊格尔顿 Terry Eagleton |
|
1944 1944 |
拉吉夫·甘地 Rajiv Gandhi |
|
1945 1945 |
丹尼尔·奥尔特加 Daniel Ortega |
|
1946 1946 |
比尔·克林顿 Bill Clinton |
|
1947 1947 |
萨尔曼·拉什迪 Salman Rushdie |
|
1948 1948 |
克拉伦斯·托马斯 Clarence Thomas |
|
1949 1949 |
纳瓦兹·谢里夫 Nawaz Sharif |
我们很好奇,人们能认出这些逝去的名人吗?于是我们做了一个完全不科学的调查。我们询问了哈佛大学历史系的一位教授,他从150人中认出了116人。我们认识的一位历史系研究生认出了123人;一位记者认出了103人;一位应届大学毕业生认出了73人;一位俄罗斯理论物理学家认出了58人;一位新加坡本科生认出了35人。
We were curious how well people would do at recognizing these, the most famous people of bygone years, so we did a completely unscientific poll. We asked a professor in the department of history at Harvard, who identified 116 of 150. A history grad student we know managed 123; a journalist, 103; a recent college grad, 73; a Russian theoretical physicist, 58; an undergraduate student in Singapore, 35.
虽然人们对毕业生代表的名字认识不一,但有些毕业生代表却鲜为人知,比如1868年的威廉·艾伦·怀特,一位颇具影响力的报纸编辑和重要的进步主义领袖;或者1886年的范·威克·布鲁克斯,一位普利策奖得主、历史学家和马克·吐温的早期传记作者。还记得科德尔·赫尔吗?可惜,只有那位历史教授还记得。
Although people varied a great deal in terms of which names they recognized, some valedictorians were unknown to everybody, like 1868’s William Allen White, an influential newspaper editor and an important progressive leader; or 1886’s Van Wyck Brooks, a Pulitzer Prize–winning historian and an early biographer of Mark Twain. Remember Cordell Hull? Sadly, only the history professor did.
在某种程度上,令人惊讶的是,我们都没有意识到每一个名字。我们在高中学习历史时,会了解成千上万个特定的人物。但这些人物反映了一种选择,一种课程设置者的决定,即某些人物对我们来说更重要,而其他人物则不那么重要。例如,狄金森就受益于文学评论家们在她死后做出的一项决定,他们认为她的作品确实很重要,尽管这些作品在她生前几乎没有产生任何影响。我们赋予做出这种选择的人很大的权力——塑造我们历史观的权力。任何人或任何一小群人是否真的应该拥有这种权力,并不是一目了然的。
On some level, it’s remarkable that we all don’t recognize each and every one of these names. When we study history in high school, we learn about thousands of specific personalities. But those individuals reflect a choice, a decision on the part of whoever creates the curriculum that certain figures are more important for us to know about and others less so. Dickinson, for instance, benefited from a posthumous decision by literary critics who decided that her work really mattered, despite its having made almost no impact during her lifetime. We invest the people who make such choices with a great deal of power—the power to shape our view of history. It’s not immediately obvious that any person, or any small group of people, should really have that power.
另一方面,快速浏览这份名单就能清楚地发现,它也不能成为我们传给子孙后代的历史叙事基础。150位毕业生代表中,只有16位是女性;绝大多数是白人男性。这份名单本身就带有很深的偏见。
On the other hand, it’s clear from a quick look at this list that it, too, cannot be the basis for the historical narratives that we pass on to our children. Of the 150 valedictorians, only 16 are women; the vast majority are Caucasian men. This list has its own deep biases.
谁的错?这一次,错不在编写这份书单的人。我们的书单或许有很多缺陷,但强调个人观点并非其中之一。我们只是对数字进行了处理。相反,我们所观察到的偏见是这份书单真正作者的集体责任:任何写过书的人。这是历史记录的内在偏见。在某种程度上,它不仅必须反映在我们的书单中,也必须反映在所有历史研究中。无论一个人像历史学家一样读十几本书,还是像我们一样读数百万本书,我们都是从同一个庞大的收藏中抽样。没有人能免受抽样偏见的影响。历史可能会有偏袒,但统计数据不会。
Who is at fault? For once, it’s not the people who wrote down the list. Our list may have many shortcomings, but emphasis on our personal opinions is not one of them. We just crunched the numbers. Instead, the bias we observe is the collective responsibility of the true authors of the list: anyone who has ever written a book. It is the intrinsic bias of the historical record. And on some level, it must be reflected not only in our list, but in all historical research. Whether one reads books by the dozen, as a historian might, or books by the million, as we do, we’re all sampling from the same massive collection. No one is immune to sampling bias. History may play favorites, but statistics do not.
当然,认为历史记录极度偏颇的认识由来已久。然而,ngram 数据让我们能够:开始衡量偏见,让我们更清楚地认识到自己做错了什么。如果我们能更好地记住过去的偏见,或许就不会重蹈覆辙。
Of course, the understanding that the historical record is extremely biased is an old one. What ngram data lets us do, however, is start to measure the bias, giving us a clearer picture of what we’re doing wrong. If we can better remember the bias in our past, perhaps we are not condemned to repeat it.
未来,每个人都会有十五分钟的世界闻名。
In the future, everyone will be world-famous for fifteen minutes.
-他叫什么名字
—Whatshisname
安迪·沃霍尔曾敏锐地观察到名声的变幻莫测。但我们认为他统计的数字有误。
Andy Warhol once made a keen observation about the fickle nature of fame. But we think he got the numbers wrong.
让我们用名人堂来揭穿他的错误。近距离观察,这些名人各不相同。有些人是天才少年,有些人是大器晚成。有些人多才多艺,而有些人则专注于自己最擅长的领域。有些人职业生涯漫长,成就斐然,有些人昙花一现。但从远处看,这些差异开始消失,共同点变得更加明显。这就是安德沃德群体研究方法的强大之处。
Let’s use our Fame Hall of Fame to uncover his mistake. Viewed from up close, each of these celebrities is completely different. Some of them grew up as accomplished wunderkinds. Others are late bloomers. Some are multitalented, whereas others stick to what they do best. Some have long careers filled with one achievement after another. Others are one-hit wonders. But from a distance, these differences start to disappear, and shared features become more apparent. This is the great power of Andvord’s cohort method.
当我们观察1871年出生的50位最著名人物(科德尔·赫尔的班级)的平均行为时,一个单一的形状浮现出来,这是1871届学生成长的总体写照。我们也可以对1872届学生进行同样的观察。同样,一个单一的形状浮现出来。值得注意的是,尽管1872届学生由50个完全不同的人组成,但他们的平均名望曲线的形状几乎完全相同。事实上,我们观察到的150个班级中,每个班级的曲线形状几乎完全相同。经过研究。这种形态是极其知名人士的典型生活方式。如果名气是物理学,那么这就是大统一理论。或者至少,它会是某种理论。
When we look at the average behavior of the fifty most famous people born in 1871 (Cordell Hull’s class), a single shape emerges, an overall portrait of how the class of 1871 got big. We can do the same thing for the class of 1872. Again, a single shape emerges. What’s remarkable is that even though the class of 1872 consists of fifty completely different people, the shape of its average fame curve is almost exactly the same. Indeed, the shape is almost exactly the same for each and every one of the 150 classes we studied. This shape is the typical lifestyle of the extremely famous. If fame were physics, this would be the Grand Unified Theory. Or at least, it would be some sort of theory.
让我们更仔细地看看到底发生了什么。
Let’s take a more careful look at what’s going on.
起初,我们并没有观察到任何迹象:很长一段时间里,这个班级的成员几乎从未在书籍中被提及。这并不奇怪。当十二岁的奥维尔·莱特骑着自行车四处奔波时,没有人写书来记录他那句“有一天他会飞”的宣言。
Initially, no signal is observed: For a long time, members of the class are almost never mentioned in books. That’s no surprise. When twelve-year-old Orville Wright was pedaling around on a bicycle, no one was writing books about young Orville’s pronouncements that one day he was going to fly.
在出生几十年后的某个时刻,这些阶层的成员会首次亮相社交舞台。我们所说的“首次亮相”是指他们的平均频率大于十亿分之一——也就是我们在上一章讨论过的词被收录进字典的截止频率。在我们看来,成名的标准是你的名字值得被收录进字典。
At some point a few decades after their birth, the class members make their debut on the social scene. By debut, we mean that their average frequency is greater than one part per billion—that’s the cutoff frequency for getting a word into the dictionary that we discussed in the previous chapter. By our lights, the standard for being famous is that your name deserves to be in the dictionary.
但她们并非普通的初次登台亮相的少女。她们的到来并没有迅速引发人们的关注,随后她们也迅速退场。相反,1871届的名人,如同其他所有名人一样,以惊人的能量突然出现。其成员的名气以惊人的速度增长。每隔几年,他们的平均频率就会翻一番,几十年来一路飙升。用数学语言来说,他们的增长速度呈指数级增长——就像病毒式传播或病毒式传播的视频。在历史的伟大舞台上,他们的表演堪称精彩绝伦。
But these are no ordinary debutantes. Their arrival is not greeted by a quick flurry of interest, followed by a quicker exit. Instead, the class of 1871, like every other class of famous people, bursts onto the scene with tremendous energy. Its members’ fame rises at an extraordinary pace. Every handful of years, its average frequency doubles, skyrocketing over a period of decades. In the language of mathematics, its growth is exponential—like a viral epidemic or a viral video. On the great stage of history, theirs is a bravura performance.
最终,在七十五岁时,1871届的学生们迎来了人生的巅峰。跨过这个门槛,从数量上来看,他们已经走到了人生的巅峰。接下来对他们来说,必定是一段全新的经历,因为这些曾经年轻的斗士们将步入一段持续数百年的缓慢衰落。
Finally, at seventy-five years of age, the class of 1871 reaches its peak. Crossing that threshold, they are, in purely numerical terms, over the hill. What comes next for them must be a new experience, as these once-youthful firebrands enter a slow decline that will last for centuries.
这种曲线——初出茅庐、指数增长、巅峰、缓慢衰落——在我们研究的所有阶层中都普遍存在。但不同阶层之间存在细微的差异,这些差异可以用三个参数来描述:初出茅庐时的年龄、指数增长的速度以及巅峰后衰落的速度。从数学上讲,需要第四个参数来描述这条曲线:阶层衰落的年龄。但据我们目前所能测量的,这个年龄似乎并没有太大变化。所有阶层都在诞生后大约四分之三个世纪达到巅峰。
This shape—debut, exponential growth, peak, and slow decline—is universal across all the classes we studied. But there are subtle variations from class to class, variations that can be described in terms of three parameters: their age at debut, the speed of their exponential growth, and the rate of their post-peak decline. Mathematically, a fourth parameter is needed to describe this curve: the age at which the class is over the hill. But as best we can measure, this doesn’t seem to change much. All classes peak roughly three-quarters of a century after their birth.
让我们来谈谈初出茅庐的年龄。一个班级名声大噪,一半成员的讨论频率堪比词典里的常用词。对于1800年的班级来说,初出茅庐的年龄是43岁。我们心想,这还不错——我们还有时间。但初出茅庐的年龄却越来越小了。事实上,到20世纪中叶,这个年龄已经下降到29岁了。
Let’s talk about the age at debut, when a class becomes so famous that half of the members are as frequently discussed as the typical word in the dictionary. For the class of 1800, this occurred at age forty-three. Not too bad, we think to ourselves—we’ve still got time. But the age at debut is getting younger and younger. In fact, by the mid–twentieth century, it had declined to twenty-nine years.
值得深思的是,到 29 岁时,1950 届毕业生中有一半已经达到了英文书籍中词典级别的提及频率。这使得它们非常非常出名。
This fact is worth contemplating: By the time they were twenty-nine years old, half of the class of 1950 had reached dictionary-level mention frequencies in English books. Making them really, really famous.
对我们大多数人来说,这都是一个令人警醒的现象。比如,当我们发现这一点时,JB已经28岁了。时间紧迫,年轻的JB仍然充满希望,尽管很明显他最好尽快行动起来。然而,Erez已经30岁了。已经太晚了。
For most of us, this is a phenomenally sobering state of affairs. For instance, when we made this discovery, JB was twenty-eight years old. Just under the wire, there was still hope for young JB, though it was clear that he had better make his move soon. Erez, however, was thirty. It was already too late.
如果你的目标是成为你这一代最有名的人之一,那么这些信息尤其有用。对于我们那些雄心勃勃、十几岁和二十几岁的读者来说,这应该是一个微妙的提醒,提醒他们赶紧行动起来。三十多岁的读者应该意识到他们已经落后了。四十岁以上的读者可能需要一些外部指导。我们将在下一节讨论这个问题。(不要沮丧。有一些策略可以帮助你在黄金岁月中继续保持名声。)
This is particularly useful information if your goal is to be one of the most famous people of your generation. For our ambitious readers in their teens and twenties, this ought to be a subtle reminder to get cracking. Readers in their thirties should be aware that they are already running late. Readers over forty probably warrant some external guidance. We’ll come to that in the next section. (Do not be dismayed. Strategies exist for winning fame well into your golden years.)
人们不仅在更年轻的时候成名,而且名气增长得更快。对于1800届的人来说,名气翻倍大约需要八年时间,从43岁出道到75岁达到巅峰,大约翻了四倍。而对于1950届来说,名气翻倍的时间要快得多,大约只需要三年。
Not only do people get famous at a younger age, but their fame grows faster. For the class of 1800, it took about eight years for fame to double, allowing about four doublings between their debut at age forty-three and their peak at age seventy-five. For the class of 1950, the doubling time was much faster, only about three years.
因此,尽管曲线形状相同,但年轻一代比年长一代更出名。就疾病而言,名气与肺结核截然相反。两代人的曲线看起来都一样,但年轻一代非但没有更能抵御名气,反而更容易受到影响。如今在世的名人比他们的前辈要出名得多,而且名气呈指数级增长。
As a result, although the shape of the curve is the same, the younger classes get much more famous than the older classes. As diseases go, fame is the opposite of tuberculosis. The curve looks the same for each cohort, but instead of being more resistant to fame, younger cohorts are more likely to be afflicted. The most famous people alive today are exponentially more famous than their predecessors.
为了更好地理解这些课程的知名度,不妨将它们与我们每天接触到的物品进行比较。想想农产品货架。在巅峰时期,2克重的比尔·克林顿几乎它的出现频率与“生菜”(lettuce)一词完全相同,是“黄瓜”(cucumber)一词的两倍,大约是“番茄”(tomato)一词的一半。它完全超越了萝卜和花椰菜等二流蔬菜。我们甚至不会提及芜菁甘蓝和苤蓝的悲惨命运。
To give a sense of how famous these classes can get, it’s helpful to compare them to objects we encounter every day. Consider the produce aisle. At his peak, the 2-gram Bill Clinton was almost exactly as frequent as the word lettuce, twice as frequent as the word cucumber, and about half as frequent as the word tomato. He completely outclassed second-tier vegetables like turnip and cauliflower. We won’t even bring up the sorry fate of rutabaga and kohlrabi.
第三个参数考察的是名气在巅峰之后衰落的速度。就像放射性元素或不规则动词一样,名人的名气也有一个半衰期,即一段典型的时间,在此期间,名气往往会衰减一半。这个参数的时间尺度也越来越短。1800年,这个半衰期是120年。到1900年,半衰期已经缩短到71年。人们越来越出名,但他们被遗忘的速度也越来越快。所以,别再想那位老家伙说的那句话了:未来每个人的全球知名度都只有7.5分钟。
The third parameter examines how fast fame declines after the peak. Like a radioactive element or an irregular verb, the fame of the famous has a half-life, a characteristic period of time over which it tends to decline by half. The timescale for this parameter has also been getting shorter. In 1800, this half-life was 120 years. By 1900, the half-life had dropped to 71. People are getting more famous, but they are also forgotten faster. So forget about what ol’ whatshisname said: In the future, everyone will be world-famous for only 7.5 minutes.
幸运的是,超级名人无需担忧。他们应该记住这样一个故事:在一次会议上,有人说太阳将在45亿年后消亡,他长叹一口气,说道:“真是松了一口气!我还以为是450万年呢。” 等到名望的半衰期开始对他们产生明显影响时,超级名人也已经彻底消亡了。
Fortunately, the extremely famous have nothing to worry about. They should bear in mind the story of the man who, upon hearing at a conference that the sun will die out in 4.5 billion years, sighed loudly and remarked, “What a relief! I thought it was 4.5 million years.” By the time the decreasing half-life of fame begins to affect them appreciably, the extremely famous will be extremely dead.
你们当中有些人可能还很年轻,还没有做出那个重大的决定:“我长大后想做什么?” 你们应该成为一名作家,用文字的力量激励观众吗?还是成为一名电影演员,塑造角色?在模拟爆炸场景中,用真实的情感表达来赋予生命?你想成为一名歌手?一名舞者?一名教师?一名警察?一名政治家?还是一名摇滚明星?你想成为第一个登陆火星的宇航员,还是下一个巴勃罗·毕加索?所有这些选择都向你开放。
Some of you are probably young enough that you have yet to make the great and momentous decision, “What do I want to be when I grow up?” Should you become a writer, inspiring audiences through the power of your words? A film actor, bringing characters to life with authentic displays of emotion amid simulated displays of explosion? Should you be a singer? A dancer? A teacher? A police officer? A politician? A rock star? Do you want to be the first astronaut to land on Mars, or the next Pablo Picasso? All these options are open to you.
选择职业的一大挑战是缺乏可靠的数据,无法预知如果选择某个职业,你的生活会是什么样子。正因如此,当你问别人应该如何规划人生时,他们的建议总是那么模糊。
One of the great challenges in choosing a career is the lack of solid data, some way to know what your life would be like if you picked one option or another. That’s why, when you ask people what you should do with your life, their advice is always so vague.
但我们是数据控。其他人给你的那种“追随你的幸福”之类的松散建议根本不符合我们的风格。相反,我们会用冷冰冰的统计数据和量化数据来帮助你做出艰难的决定。
But we’re numbers guys. The loosey-goosey “Follow your bliss” type of advice that everyone else has been giving you is just not our style. Instead, we’re going to present you with cold, hard statistics, quantitative data to help you make a difficult decision.
当然,前提是,你唯一关心的事情就是变得非常非常出名。
Assuming, of course, that the only thing you care about is becoming very, very famous.
我们组建了焦点小组,成员均为出生于1800年至1920年间的名人,并按其职业进行细分。我们考察了六种可能的职业选择:演员、作家、政治家、科学家、艺术家和数学家。每种职业中,我们选取了25位最著名的人物作为焦点小组成员。如果你正在考虑成为股票经纪人、咖啡师或卡通人物,那么,很遗憾,你没那么幸运了——我们的图表空间不够。
We assembled focus groups composed of celebrities born between 1800 and 1920, broken down by their chosen occupation. We looked at six possible career choices: actor, writer, politician, scientist, artist, and mathematician. In each case, the twenty-five most famous members of that profession were included in our focus group. If you’re considering becoming a stockbroker, a barista, or a cartoon character, then, alas, you’re out of luck—we didn’t have enough room in our chart.
当然,你不仅仅想知道自己在各个行业能有多出名。如果你早已去世,或者年纪太大,无法享受名利,那么即使真正出名也毫无意义。这就好比接受一份高薪工作,但第一笔薪水要一个世纪后才能到账。为了做出明智的决定,你需要知道的是,在你的一生中,你预计能有多出名(假设一切顺利,并且如预期的那样,成为你所选择的职业中最有名望的人之一)。这就是我们为你整理的图表。
Of course, you don’t just want to know how famous you can become in each profession. Becoming really famous is no use if you’re long dead or if you’re too old to enjoy it. This would be like accepting a high-paying job in which the first paycheck won’t arrive for a century. To make an informed decision, what you want to know is how famous you can expect to be throughout your life (assuming everything goes off without a hitch and, as expected, you manage to become one of the most famous members of your chosen vocation). So that’s exactly the chart we put together for you.
这张图表将使您的决定变得更加容易。
This chart will make your decision much, much easier.
如果你真的想年轻出名,那就去做演员吧。演员往往在二十多岁或三十出头就成名,并且有一生的时间享受名声。但我们研究的演员们生活在电视等大众媒体能够推动他们事业发展的时代之前,因此他们从未像其他一些群体那样出名。
If you really want to be young and famous, be an actor. Actors tend to become famous in their late twenties or early thirties, and have a lifetime to enjoy their fame. But the actors we studied lived before mass media like television could have helped propel their careers, and never got quite as famous as some of the other groups.
如果你愿意将满足感延迟一段时间,那么成为一名作家更有意义。作家往往在三十多岁时成名,但最优秀的作家——那些创作出伟大经典作品的作家——最终会比演员更出名。这一点在书籍方面尤其明显,因为作家们喜欢写其他作家。(再次提到抽样偏差:ngram 相当于主场优势。)
If you’re willing to delay gratification for a little while longer, it makes more sense to become a writer. Writers tend to become famous in their late thirties, but the best writers—those who penned great classics—eventually become much more famous than the actors. This is particularly true when you look at books, since writers like to write about other writers. (Sampling bias again: the ngram equivalent of a home-field advantage.)
与你想象的相反,如果你真的擅长延迟满足,你或许应该成为一名政治家。政客往往要到四五十岁甚至六十岁才会声名鹊起。到了那个时候,最著名的政客们要么当选美国总统(二十五位中有十一位),要么成为其他国家的元首(另外九位),他们的名气就会迅速飙升,甚至超越其他两类人。所以,如果你已经五十多岁了,但还不是家喻户晓的人物,那么政治或许正是你的职业选择。
Contrary to what you might expect, if you’re really good at delaying gratification, you should probably become a politician. Politicians tend not to be very famous until their forties, fifties, or even sixties, at which point the most famous politicians get elected president of the United States (in eleven of the twenty-five cases) or become the head of state somewhere else (an additional nine cases), and their fame quickly soars to eclipse either of the other two groups. So if you’re in your fifties and not yet a household name, politics might just be your calling.
接下来我们来看看科学家。最著名的科学家最终的名气和演员差不多,但他们成名的时间要长得多,他们成名的时候是六十多岁,而不是二十多岁。名气小,等待的时间更长。出演《生活大爆炸》肯定比研究《生活大爆炸》要好。
Next we took a look at scientists. The most famous scientists eventually become about as famous as the actors, but they took a lot longer to get there, achieving fame in their sixties instead of their twenties. Less fame, longer wait. It’s definitely better to star on The Big Bang Theory than to study the big bang theory.
更糟糕的是画大爆炸理论或其他任何东西的图。我们名单上的艺术家们受到了不公平的待遇。他们等待成名的时间和科学家一样长,但成果却只有科学家的一半。
Still worse is drawing pictures of the big bang theory, or of anything else. The artists on our list got a raw deal. They waited just as long for fame as the scientists, but did half as well.
但如果你想出名,最糟糕的事情就是我们所做的:追求数学。
But if you want to become famous, the worst possible thing to do is what we did: to pursue mathematics.
你可能认为并非如此。毕竟,据说数学家在年轻时成就非凡,之后他们大概就可以放松一下了。例如,卡尔·弗里德里希·高斯十九岁时就发明了模运算,证明了二次反比律,推测出了素数定理(数学中最深奥、最基础的成果之一),并发现了关于整数分解为三角数的深刻结论。这并非他十九岁时所做的全部工作;这只是他在大约三个月的时间里所做的工作。真是炫耀。
You might think this is not so. After all, mathematicians are said to do their best work when they are young, after which they can presumably put their feet up and relax. For example, when he was nineteen years old, Carl Friedrich Gauss invented modular arithmetic, proved the law of quadratic reciprocity, conjectured the prime number theorem (one of the deepest and most fundamental results in all of mathematics), and discovered a profound result about the decomposition of integers into triangular numbers. This is not everything he did when he was nineteen years old; this is just what he did over a span of about three months. What a show-off.
问题是公众并不关心像年轻的卡尔·弗里德里希这样的数学家正在这样做。当我们焦点小组中的数学家们设法发出明显的名声信号时,他们中的大多数已经去世了。数学不会让你出名。已验证。
The problem is that the public does not care what mathematicians like young Carl Friedrich are doing. By the time the mathematicians in our focus group managed to generate an appreciable fame signal, most of them were dead. Math won’t make you famous. QED.
我们知道人们何时成名,成名速度有多快,又会多快被遗忘,甚至知道哪些职业选择让他们声名鹊起。但在结束关于名望和ngrams的讨论之前,我们必须提出一个简单的问题:归根结底,过去两个世纪里出生的最著名的人是谁?
We know when people get famous, how fast they get famous, how quickly they’ll be forgotten, and even which career choices lead them to fame. But it’s impossible to conclude our discussion of fame and the ngrams without asking a very simple question: When all is said and done, who are the most famous people born in the last two centuries?
为了考察最著名的人物,我们需要稍微改变一下方法。我们目前使用的策略——追踪人们全名的提及次数——对于观察一个人或一群人随时间的变化非常有效。但在比较不同的个体时,各种奇特的效应使得全名频率成为一个糟糕的选择。
To examine the most famous people, we’ll need to change our methods a little bit. The strategy that we’ve used so far—tracking mentions of people’s full names—is great for looking at one person or a group of people over time. But when comparing different individuals, there are all kinds of peculiar effects that make full-name frequency a poor choice.
例如,考虑一下以下这个毫不奇怪的事实。在提及大多数人时,作者倾向于使用该人的姓氏,而不是写出全名。如果你看到“Einstein”这个词,那么之前出现的单词是“Albert”的概率只有十分之一。
For instance, consider the following totally unsurprising fact. When referring to most people, writers tend to use that person’s last name rather than write out a full name. If you see the word Einstein, the chance that the previous word was Albert is only about one in ten.
但如果一个人的姓和名都只有一个音节,人们会更频繁地写出他们的全名。如果你看到“Twain”这个词,那么前一个词是“Mark”的可能性就大于50%。
But if a person’s first name and last name are both only one syllable long, people will write out their full name much more often. If you see the word Twain, the chance that the previous word was Mark is better than 50/50.
解决这个问题最简单的方法是停止跟踪不再追踪某人全名的提及,而是追踪其姓氏的提及。这样做的另一个好处是,由于上述原因,您可以捕捉到更多提及。最大的缺点是,一些极其著名的人物,例如富兰克林·德拉诺和泰迪·罗斯福,他们的姓氏含义不明。他们两人在罗斯福的提及中占了很大比例,因此我们无法使用数据来获得其中任何一个人的准确数字。
The simplest way to solve this problem is to stop tracking mentions of a person’s full name and instead track mentions of their last name. An extra advantage of this is that you catch a lot more mentions, for the reasons pointed out above. The big disadvantage is that some extremely famous people, like Franklin Delano and Teddy Roosevelt, have last names that are ambiguous. Both of them account for a very large proportion of Roosevelt mentions, making it impossible to use our data to get accurate numbers for either one.
另一件需要注意的重要事情是,我们的方法无法区分名声和恶名。ngram 数据无法提供足够的背景信息,无法提供出现在相关名称前后的足够词汇,因此无法判断提及是正面的还是负面的。
Another important thing to note is that our approach can’t distinguish between fame and infamy. The ngram data doesn’t give us enough context, enough of the words that appear before or after the name in question, to determine if a mention is positive or negative.
唉,尽管这些问题让我们苦不堪言,我们还是得先把这些问题放在一边。在目前的阶段,像我们这样的清单只能算是一项未完成的工作——充其量也只能算是一个莱特式的风洞,当然也算不上LENS-X涡轮机。
But alas, much as they gnaw at us, we’re going to have to set those issues aside. At this stage in the game, lists like ours can be regarded only as a work in progress—at best a Wright-style wind tunnel, and certainly not a LENS-X turbine.
以下是过去两个世纪出生的十位最著名的人物的名单:
With that, here is a list of the ten most famous people born in the last two centuries:
人类历史上最邪恶的人之一阿道夫·希特勒位居榜首,这一事实令人震惊。事实上,名单上至少有三位大屠杀的凶手:希特勒,他的纳粹政权屠杀了1000万至1100万无辜平民和战俘;苏联领导人约瑟夫·斯大林,他的政权杀害了约2000万本国公民;以及贝尼托·墨索里尼,他是意大利独裁者(当时意大利是希特勒轴心国的一部分),也是埃塞俄比亚种族灭绝事件的策划者,这场灭绝事件导致三十万人丧生。
It’s impossible not to be struck by the fact that Adolf Hitler, one of the most evil men in human history, tops the list. In fact, no fewer than three mass murderers appear on the list: Hitler, whose Nazi regime murdered between ten and eleven million innocent civilians and prisoners of war; Joseph Stalin, leader of the Soviet Union, whose regime killed approximately twenty million of its own citizens; and Benito Mussolini, dictator of Italy while it formed part of Hitler’s axis, and architect of the Ethiopian genocide that led to three hundred thousand deaths.
谋杀与名声息息相关。当代美国的一个悲剧事实是,时不时会有精神失常的疯子持枪公开杀人。这种可怕现象的诸多悖论之一在于,凶手在案发前默默无闻,如今却成为媒体风暴的焦点。一方面,这种新闻报道很重要,因为人们需要了解发生了什么。但另一方面,由此产生的关注也可能成为凶手的动机。谋杀约翰·列侬的马克·戴维·查普曼在接受假释委员会的采访时就曾说过:“我这么做是为了博取关注。从某种意义上说,我就是想窃取约翰·列侬的名声,并将其加诸于自己。”
Murder and fame are linked. A tragic fact about the contemporary United States is that from time to time, deranged lunatics bearing guns engage in public killing sprees. One of the many paradoxes of this terrible phenomenon is the extent to which the killer, who was a complete unknown before the event, becomes the center of a massive media storm. On the one hand, this sort of news coverage is important, because people need to be aware of what has happened. But on the other hand, the resulting attention can become a motivation for the killers. Mark David Chapman, who murdered John Lennon, said as much when he told his parole board: “I did it for attention. To, in a sense, steal John Lennon’s fame and put it on myself.”
可悲的是,当我们以尽可能宏大的规模审视历史记录时,类似的效应似乎也成立。我们利用ngrams回溯历史,并列出了过去二十年中每个十年最著名的十人。当我们查看1940年左右的名单时,希特勒和斯大林都未上榜。但到了1950年,在犯下规模空前、残酷程度空前的暴行后,希特勒、斯大林和墨索里尼分别跃居榜首、第二和第五位。相比之下,亚伯拉罕·林肯,或许是……最伟大、最具道德勇气的美国总统,排名从未超过第 5 位。
Tragically, when we examine the historical record at the grandest possible scale, a similar sort of effect seems to hold true. We used ngrams to go back in time, and we generated a list of the ten most famous people for each of the last twenty decades. When we examine the list circa 1940, neither Hitler nor Stalin appears. But by 1950, after carrying out atrocities of unprecedented magnitude and cruelty, Hitler, Stalin, and Mussolini jump to numbers 1, 2, and 5, respectively. In contrast, Abraham Lincoln, perhaps the greatest and most morally courageous of American presidents, never appears above number 5.
正如我们所见,用 ngram 探索名声可能引人入胜、令人困惑,甚至妙趣横生。但 ngram 中也潜藏着黑暗,而且没有比这更黑暗的秘密了:没有什么比极端邪恶的行为更能有效地造就名声了。我们生活在一个以杀人为名声最可靠途径的世界,我们有责任思考这意味着什么。
As we’ve seen, exploring fame using ngrams can be intriguing, perplexing, and even fun. But darkness, too, lurks among the ngrams, and no secret darker than this: Nothing creates fame more efficiently than acts of extreme evil. We live in a world in which the surest route to fame is killing people, and we owe it to one another to think about what that means.
一定要这样吗?ngrams 也能给我们提供一些提示。因为在希特勒之前登上名人榜榜首、从 1880 年到 1940 年一直位居第一的人,并不是一个大屠杀的凶手。他是一位作家、一位社会评论家、一位“和蔼可亲、充满爱心的幽默家”,而且大多数人都认为他是个好人。他甚至可能还给我们送过“圣诞快乐!”
Does it have to be this way? Here, too, the ngrams can provide us with a hint. Because the person who preceded Hitler at the top of the fame list, holding the number one spot from 1880 to 1940, was not a mass murderer. He was a writer, a social critic, a “genial and loving humorist,” and by most accounts a good man. He may even have given us “Merry Christmas!”
查尔斯·狄更斯。和平与战争。这是最好的时代,也是最坏的时代。
Charles Dickens. Peace and war. It was the best of times, it was the worst of times.
T1957年,苏联发射了第一颗人造卫星斯普特尼克(Sputnik),这颗卫星激发了全世界的想象力,并预示着太空竞赛的爆发。1969年7月21日,两名美国人登陆月球并进行了一次太空漫步,美国赢得了这场竞赛。
The USSR’s launch of the Sputnik satellite in 1957 captured the world’s imagination and heralded the space race. That race was won by the United States on July 21, 1969, when two Americans landed on the moon and went for a stroll.
更具体地说,太空竞赛的胜利者是尼尔·阿姆斯特朗,他行程23.9万英里,成为第一个在遥远星球表面行走的人类。你可能听说过他。
More specifically, the space race was won by Neil Armstrong, who traveled 239,000 miles to become the first human to walk on the surface of a distant world. You’ve probably heard of him.
你很少听说过另一位美国英雄,巴兹·奥尔德林。奥尔德林也登上了月球,实现了人类可能共同拥有了数万年的梦想。而且,他也是1969年7月21日登月的。但他并非第一人:奥尔德林比阿姆斯特朗晚了十九分百分之一秒,迈出了这一小步。因此,他的名气大约低了五倍。
You’re much less likely to have heard of another American hero, Buzz Aldrin. Aldrin walked on the moon, too, thereby fulfilling a dream mankind has probably shared for tens of thousands of years. And he also did it on July 21, 1969. But he wasn’t first: Aldrin took his small step nineteen minutes and one one-hundredth of a second after Armstrong. As a result, he’s about five times less famous.
寓意:如果你计划做一些传奇的事情,那么请在二十分钟的咖啡休息时间之前去做。
The moral: If you’re planning to do something legendary, do it before your twenty-minute coffee break.
寂静之声
THE SOUND OF SILENCE
Dort wo man Bücher verbrennt, verbrennt man auch am Ende Menschen。
Dort wo man Bücher verbrennt, verbrennt man auch am Ende Menschen.
他们焚书的地方,最终也会焚人。
Where they burn books, they will, in the end, burn people.
—海因里希·海涅(1797-1856),德国犹太诗人,1933 年被纳粹列入黑名单
—Heinrich Heine (1797–1856), German Jewish poet blacklisted by the Nazis in 1933
T书籍中反映的数百万种声音,讲述了关于我们文化和历史的悠久而迷人的故事。但并非所有人的声音都记录在我们的书架上。有时,那些缺失声音的沉默会淹没其他一切。
The millions of voices reflected in books tell a long and fascinating story about our culture and our history. But not everyone’s voice is recorded on our bookshelves. And sometimes the silence of the missing voices can drown out everything else.
海伦·凯勒,一位几乎被我们的文化遗忘的人物。她出生于1880年,年仅19个月时患上疾病,导致失聪失明。凯勒成长于一个这样的时代,残疾几乎使人无法接受教育。但她坚持不懈。作为第一位获得学士学位的聋盲人,凯勒最终成长为一位颇具影响力的作家、一位政治活动家,以及一位积极倡导残疾人权益的倡导者。在此过程中,凯勒成为数百万人的英雄,成为人类精神战胜巨大逆境的象征。
One of the people whose voices our culture nearly missed out on was Helen Keller. Born in 1880, she was left deaf and blind by an illness she contracted when she was only nineteen months old. Keller came of age in an era when such disabilities made it nearly impossible for someone to become educated. But she persevered. As the first deaf-blind person to earn a bachelor’s degree, Keller eventually grew to be an influential author, a political activist, and an eloquent advocate for the needs of the disabled. In the process, Keller became a hero to millions, a symbol of the triumph of the human spirit over profound adversity.
然而,在人类历史上最黑暗的时刻之一,凯勒不得不再次面对试图压制她的声音以及众多其他人的声音的企图。
And yet at one of the darkest moments in human history, Keller had to confront an attempt to silence her voice—and the voice of a legion of others—once again.
1933年,纳粹开始占领德国,意图控制其政府、人民乃至文化。这场运动的一个表现是查禁那些被当局认为反映“非德国精神”的书籍。在纳粹领导人的煽动下,一群学生强行将这些书籍从图书馆和书店中拿走,并在全德国范围内掀起焚书运动。海伦·凯勒就被列入了黑名单。
In 1933, the Nazis began taking over Germany, aiming to control its government, its people, and even its culture. One manifestation of this movement was the suppression of books believed by the authorities to reflect an “un-German spirit.” Urged on by Nazi leaders, mobs of students forcibly removed such books from libraries and bookstores and set them aflame in book burnings that erupted all across Germany. Included among the blacklisted authors was Helen Keller.
凯勒的回应是一封公开信,刊登在《纽约时报》和许多其他报纸的头版,这封信曾经是、现在仍然是永恒的呐喊:
Keller’s response, an open letter published on the front page of the New York Times and many other newspapers, was and remains a timeless cri de coeur:
1933年5月9日
May 9, 1933
致德国学生:
To the student body of Germany:
如果你认为可以扼杀思想,那么历史就没有任何教益。暴君们过去常常试图这样做,而思想却以他们的强大力量崛起,摧毁了他们。
History has taught you nothing if you think you can kill ideas. Tyrants have tried to do that often before, and the ideas have risen up in their might and destroyed them.
你可以烧毁我的书,以及欧洲最优秀思想家的著作,但其中的思想已通过无数渠道渗透,并将继续激发其他人的思想。我把我所有书籍的版税永久地赠予了在第一次世界大战中失明的德国士兵,我心中除了对德国人民的爱与同情之外,没有任何别的想法。
You can burn my books and the books of the best minds in Europe, but the ideas in them have seeped through a million channels and will continue to quicken other minds. I gave all the royalties of my books for all time to the German soldiers blinded in the World War with no thought in my heart but love and compassion for the German people.
我承认,严重的并发症导致了你们的不宽容;我更加谴责将你们的行为的耻辱传给后代的不公正和不明智。
I acknowledge the grievous complications that have led to your intolerance; all the more do I deplore the injustice and unwisdom of passing on to unborn generations the stigma of your deeds.
别以为你们对犹太人的暴行在这里无人知晓。上帝不会睡觉,祂会将祂的审判临到你们。你们宁愿脖子上挂着磨石沉入海中,也不愿被所有人憎恨和鄙视。
Do not imagine that your barbarities to the Jews are unknown here. God sleepeth not, and He will visit His judgment upon you. Better were it for you to have a mill-stone hung around your neck and sink into the sea than to be hated and despised of all men.
海伦·凯勒
Helen Keller
凯勒充满激情的论调:“如果你认为你可以扼杀思想,那么历史就什么也没教给你。”这番话在全世界引起了共鸣。它引发了国际社会的轩然大波,最终导致纳粹宣传机器将焚书事件诬陷为非官方的“德国学生协会的自发行为”。
Keller’s impassioned argument that “history has taught you nothing if you think you can kill ideas” struck a chord the world over. It touched off an international furor, eventually leading the Nazi propaganda machine to frame the book burnings as unofficial “spontaneous acts by the German Students Association.”
尽管凯勒在世界舆论的法庭上占了上风,但她真的正确吗?扼杀一个想法真的是不可能的吗?我们对这个问题的探索将迫使我们直面人类表达的阴暗面:审查、压制和恶名昭彰。想要一窥这黑暗的现实,没有什么比最著名的窗户工匠——艺术家马克·夏加尔的人生更能展现这一点了。
Though Keller carried the day in the court of world opinion, was she actually right? Is it really impossible to kill an idea? Our quest to answer this question will force us to tackle the dark side of human expression: the world of censorship, of suppression, and of infamy. To get a glimpse of this dark reality, there are few better windows than the life of the most famous of all window-wrights, the artist Marc Chagall.
“去图书馆找一本书吧,白痴;选择任何你喜欢的图片;然后复制它。”
“Go and find a book in the library, idiot; choose any picture you like; and just copy it.”
这位同学关于绘画的建议,开启了她非凡的艺术生涯Móyshe Shagal,改变来自白俄罗斯维捷布斯克的鲱鱼商人的儿子马克·夏加尔是二十世纪典型的犹太艺术家。
This advice on how to draw, from a schoolmate, launched the extraordinary artistic career of Móyshe Shagal, transforming the son of a herring trader from Vitebsk, Belarus, into “the quintessential Jewish artist of the twentieth century,” Marc Chagall.
作为现代主义运动的先驱,夏加尔是二十世纪中叶最杰出的艺术家之一。他最为著名的作品是彩色玻璃窗。他的“耶路撒冷之窗”将色彩、玻璃和光线完美融合,是以色列的国家地标,甚至还印在了以色列的邮票上。夏加尔的彩色玻璃窗也装点着联合国,并照亮了欧洲各地的大教堂。当马蒂斯去世时,”巴勃罗·毕加索曾经说过,“夏加尔将是唯一一位真正懂得色彩的画家。”
A pioneer of the modernist movement, Chagall was one of the leading artists of the mid–twentieth century. He is famous, above all, for his stained-glass windows. Fusions of color, glass, and light, his Jerusalem Windows are an Israeli national landmark—they have even appeared on the nation’s postage stamps. Chagall’s windows also grace the United Nations and illuminate cathedrals throughout Europe. “When Matisse dies,” Pablo Picasso once said, “Chagall will be the only painter left who understands what color really is.”
和上一章讨论过的许多名人一样,夏加尔年纪轻轻便声名鹊起。1917年俄国革命后,年仅三十岁的夏加尔被任命为夏加尔担任全俄视觉艺术委员。然而,战争和饥荒正在摧残俄罗斯人民的生活。不久,尽管夏加尔是俄罗斯最著名的年轻艺术家之一,他却西行前往巴黎。
Like many of the famous people discussed in the previous chapter, Chagall became prominent at a young age. After the Russian Revolution of 1917, when he was only thirty years old, Chagall was offered the position of commissar for the visual arts over all of Russia. But war and famine were taking their toll on Russian life. Soon, despite being one of the nation’s most famous young artists, Chagall headed west to Paris.
1923年抵达巴黎时,夏加尔的名气并不大,他不得不努力重塑自我。他深知移民的选择对他的名望和声誉造成的影响。夏加尔在给俄罗斯收藏家兼评论家帕维尔·埃廷格(Pavel Ettinger)的信中透露了这一点:
When he arrived in Paris in 1923, Chagall was not as well known, and he had to work hard to reestablish himself. He was exquisitely aware of the impact that his choice to emigrate had made on his fame and reputation. Chagall confides as much in a letter to Pavel Ettinger, a collector and commentator back in Russia:
1924年3月10日
March 10, 1924
我担心我的“形象”正在一点一点地……消逝……这并不奇怪。我在这里待了很久,在绘画的故乡。关于我自己,我该说些什么呢?我可以说很多,但我必须简短。渐渐地,他们在法国开始注意到我了……
I’m afraid that my “image” is little by little . . . fading. . . . It is no wonder. I have been here for quite a while, in the homeland of painting. What shall I say about myself. I could say a lot, but I have to be brief. Gradually, they are beginning to notice me in France. . . .
夏加尔为了简洁起见,总结了他最近的经历,说“法国人开始注意到我了”,同时又担心自己在法国的形象正在“消退”。这份担忧,是一封长期通信者之间亲密信件的核心,它体现了惊人的量化:人们多久会想到、谈论和写作夏加尔?
Needing to be brief, Chagall sums up his recent experience by saying that “they are beginning to notice me in France,” while at the same time expressing fear that his image is “fading” back home. This concern, the centerpiece of an intimate letter between longtime correspondents, is remarkably quantitative: How often are people thinking, talking, and writing about Chagall?
当然,夏加尔缺乏精确的方法来衡量他的名气,以及他的名气将如何发展。但至少就他的名气被书籍提及的程度而言,我们很容易考察。
Of course, Chagall lacked any precise way of measuring how famous he was and in which direction his fame was going. But at least to the extent that his fame led to mentions in books, it’s easy for us to examine.
夏加尔的评价一针见血。我们很容易就能看出他移民选择的影响,这种影响在他写给埃廷格的信时就已经相当明显了。
Chagall’s assessment was dead-on. We can readily see the effects of his choice to emigrate, which were already quite pronounced by the time of his letter to Ettinger.
但夏加尔的声望很快就受到了他无法控制的事件的影响。在莱茵河的另一岸,一支棕色军队正在集结。像夏加尔这样的先锋艺术家很快就被贴上了“非德国人”的标签。而夏加尔的处境则更加岌岌可危:他是犹太人。
But Chagall’s prominence would soon be affected by events well beyond his control. On the other bank of the Rhine, a brown army was massing. Avant-garde artists, like Chagall, would soon be dubbed “un-German.” And Chagall’s situation was even more precarious: He was a Jew.
20世纪20年代,德国是艺术的天堂。达达主义、包豪斯主义、表现主义和立体主义都曾在那里扎根。然而,阿道夫·希特勒强烈反对这些风格。他是一位品味保守的失败艺术家。此外,这些新兴运动的自由放任性与他利用文化进行社会控制的计划背道而驰。
In the 1920s, Germany was a haven for the arts. Dada, Bauhaus, Expressionism, and Cubism had all taken root there. But Adolf Hitler strongly objected to these styles. He was a failed artist with conservative tastes. Moreover, the freewheeling nature of these new movements was contrary to his plan of using culture as a form of social control.
为了证明希特勒希望对德国文化实施严厉的控制,帝国在很大程度上依赖于世纪之交的一位批评家的理论,他的名字叫马克斯·诺尔道。诺尔道声称,现代文化的许多方面,例如前卫艺术,都是迄今为止未被认识到的精神疾病的产物,例如视觉皮层功能障碍。基于此,纳粹认为有必要清除德国文化中此类影响,并将其归类为犹太文化,尽管诺尔道本人是犹太人,并且还是一位重要的犹太复国主义领袖。1933年9月,希特勒允许帝国宣传部长约瑟夫·戈培尔创建帝国文化协会(Reichskulturkammer)。其使命是:执行希特勒净化德国文化的计划。
To justify the draconian control of German culture that Hitler hoped to exert, the Reich relied extensively on the theories of a turn-of-the-century critic named Max Nordau. Nordau claimed that many aspects of modern culture, such as avant-garde art, were a product of hitherto-unrecognized mental diseases, such as dysfunctions of the visual cortex. On this basis, the Nazis argued that it was necessary to rid German culture of such influences, which they labeled Jewish, notwithstanding the fact that Nordau himself was Jewish, and an important Zionist leader to boot. In September 1933, Hitler allowed Joseph Goebbels, Reich minister of propaganda, to create the Reichskulturkammer (Reich Culture Chamber). His mission: to carry out Hitler’s plans for purifying German culture.
在戈培尔的领导下,帝国文化协会成为德国艺术界最重要的机构。戈培尔宣布,“将来,只有商会的会员才能为我们的文化生活做出贡献。只有满足入会条件的人才能成为会员。” 除其他要求外,会员资格还要求出示雅利安血统证明,并表明愿意认同帝国文化馆的意识形态。因此,戈培尔可以放心地得出结论:“这样,所有不受欢迎和有害的因素都被排除在外了。” 纳粹并不满足于仅仅通过卡夫卡式的会员要求来束缚艺术家。1937 年 6 月,戈培尔任命希特勒最喜爱的画家之一阿道夫·齐格勒领导帝国文化馆内的一个新委员会。该委员会的任务是从全国公共和私人收藏中没收纳粹认为堕落的艺术品。
Under Goebbels, the Reichskulturkammer became by far the most important institution in German artistic life. Goebbels announced, “In the future, only those who are members of a chamber are allowed to be productive in our cultural life. Membership is open only to those who fulfill the entrance condition.” Among other things, membership required showing a certificate of Aryan ancestry and demonstrating a willingness to go along with the ideology of the Reichskulturkammer. Thus, Goebbels could safely conclude, “In this way all unwanted and damaging elements have been excluded.” The Nazis were not content merely to hamstring artists by means of Kafkaesque membership requirements. In June 1937, Goebbels appointed Adolf Ziegler, one of Hitler’s favorite painters, to head a new commission within the Reichskulturkammer. Its task was to confiscate art that the Nazis considered degenerate from collections, public and private, throughout the country.
作为一名犹太超现实主义表现主义艺术家,夏加尔正处于被攻击的境地,他的作品很快就开始在德国消失。与此同时,数千件其他“堕落”的艺术作品也被没收,其中包括许多当今世界最著名的现代艺术家的作品——乔治·布拉克、保罗·高更、瓦西里·康定斯基、亨利·马蒂斯、皮特·蒙德里安和巴勃罗·毕加索。一些被没收的藏品被销毁,一些被纳粹领导人收缴,还有一些被藏匿在阿尔陶塞盐矿等地。这对艺术界的影响难以低估。(当爱德华·蒙克的《呐喊》于 2012 年在纽约现代艺术博物馆展出,曾拥有这幅画的一位德国犹太银行家的继承人坚持要求纽约现代艺术博物馆附上一张纸条,指出他们的父亲在纳粹掌权后被迫出售了这幅画。)
As a Jewish surrealist expressionist, Chagall was right in the crosshairs, and his works soon began to disappear from Germany. At the same time, thousands of other “degenerate” pieces were taken, including works by many of the most famous modern artists in the world today—Georges Braque, Paul Gauguin, Wassily Kandinsky, Henri Matisse, Piet Mondrian, and Pablo Picasso. Some of the confiscated pieces were destroyed, some were kept by Nazi leaders, and some were hidden away in places like the Altaussee salt mine. The effect on the art world is hard to underestimate. (When Edvard Munch’s The Scream was put on display at the Museum of Modern Art in New York in 2012, the heirs of a German Jewish banker who once owned the piece insisted that MoMA should include a note pointing out that their father had been forced to sell the painting after the Nazis rose to power.)
没收前卫艺术并禁止其创作者继续创作还不够。戈培尔和齐格勒不仅想消灭德国的现代艺术,还想抹黑它。为此,他们在慕尼黑同时举办了两个艺术展。一个展览重点展示获得政权认可的艺术家。另一个展览则展出齐格勒及其亲信忙于没收的作品。齐格勒在1937年的展览开幕致辞中发出邀请:“德国人民,来亲自评判吧!”
Confiscating avant-garde art and prohibiting those who produced it from making more was not enough. Goebbels and Ziegler didn’t just want to eliminate modern art in Germany, they wanted to discredit it. To this end, they set about creating two side-by-side art exhibitions in Munich. One exhibition highlighted artists who had the approval of the regime. The other featured works that Ziegler and his cronies had been busily confiscating. In his 1937 speech inaugurating the exhibits, Ziegler issued an invitation: “German Volk, come and judge for yourselves!”
首届展览名为“伟大的德国艺术展”( Große Deutsche Kunstausstellung),是现代史上最奢华的艺术展之一。事实上,展出的不仅仅是艺术品:展览还揭幕了艺术之家(Haus der Kunst),这是一座宏伟的新博物馆建筑,堪称纳粹建筑的典范。展览中展出了许多纳粹认可艺术家的作品,例如阿诺·布雷克(Arno Breker),他以新古典主义风格雕刻了形体完美无瑕的裸体雕塑。
The first exhibition, called the Große Deutsche Kunstausstellung (Great German Art Exhibition), was one of the most lavish art exhibitions in modern history. In fact, it was not just art that was on exhibit: The show inaugurated the Haus der Kunst (House of Art), a monumental new museum building that was a showpiece of Nazi architecture. On display were numerous works by Nazi-approved artists, such as Arno Breker, who sculpted physically flawless nudes in the neoclassical style.
第二个展览名为“堕落艺术”( Entartete Kunst)展出了齐格勒没收的许多著名作品。其中包括夏加尔、康定斯基、马克斯·恩斯特、奥托·迪克斯、马克斯·贝克曼、保罗·克利和拉斯洛·莫霍利-纳吉的作品。但这些作品并未受到与德国大艺术中心(Große Deutsche Kunstausstellung)同等的待遇。
The second exhibition, titled Entartete Kunst (Degenerate Art), featured many of the most famous works that Ziegler had confiscated. Pieces by Chagall, Kandinsky, Max Ernst, Otto Dix, Max Beckmann, Paul Klee, and László Moholy-Nagy were on display. But the pieces were not given the same treatment as those of the Große Deutsche Kunstausstellung.
这次展览并没有在一座宏伟的新博物馆举行。相反,作品被塞进了这栋建筑曾是德国考古研究所的所在地,位于二楼。只能通过狭窄的楼梯进入。展品本身拥挤不堪,悬挂简陋,而且通常没有裱框。展品上通常标注着博物馆购入的价格。由于许多展品是在20世纪20年代德国恶性通货膨胀时期购买的,因此价格格外离谱。
This exhibit did not take place in a monumental new museum. Instead, the works were crammed into a smaller space on the second floor of a building that had once housed the German Institute for Archaeology. It was accessible only by a narrow stairwell. The pieces themselves were crowded, poorly hung, and often unframed. Works were frequently labeled with the price a museum paid to acquire them. Because many had been bought during the period of German hyperinflation in the 1920s, the numbers were particularly outlandish.
展览总体上显得杂乱无章,只有一些区域专门展出纳粹认为贬低宗教、德国军事和家庭生活的作品。墙上贴满了涂鸦般的标语,例如“蓄意破坏国防”、“理想——白痴和妓女”、“病态思维下的自然”、“对德国女性的侮辱”以及“犹太人对荒野的渴望显露出来——在德国,黑人成为堕落艺术的种族理想”。 展出的110位艺术家中,只有6位是犹太人,他们的作品被放置在一个单独的房间,名为“犹太的,太犹太了”。然而,展览中暗流涌动的是,所有现代艺术都是“犹太布尔什维克”反对德国价值观的阴谋。
The exhibit was largely disorganized, except for sections dedicated to works that the Nazis thought demeaned religion or German military and family life. The walls were covered with graffiti-like slogans, such as “Deliberate Sabotage of National Defense,” “The Ideal—Cretin and Whore,” “Nature as Seen by Sick Minds,” “An Insult to German Womanhood,” and “The Jewish Longing for the Wilderness Reveals Itself—in Germany the Negro Becomes the Racial Ideal of a Degenerate Art.” Of the 110 artists whose works were on display, only six were Jewish, and their pieces were placed in a separate room, titled “Jewish, All Too Jewish.” Nevertheless, an undercurrent of the exhibit was that all of modern art was a “Jewish-Bolshevist” conspiracy against German values.
简而言之,《堕落的艺术》并非旨在成为一场通常意义上的展览。相反,它更像是一场由政府资助的颠覆性辩论展览。它是一件宣传品,旨在破坏现代艺术,将其描绘成道德沦丧、粗俗商业化、浪费纳税人资金的艺术。
In short, Entartete Kunst was not designed to be an exhibition in the ordinary sense of that word. It was, instead, the exhibition as subversive government-funded polemic. It was a propaganda piece whose goal was to undermine modern art, to present it as morally bankrupt, crassly commercial, and a waste of taxpayer funds.
这场展览轰动一时,仅在开幕四个月就吸引了超过200万参观者,平均每天近1.7万人次。它吸引的参观人数是维也纳艺术之家展览的五倍。这样的数字在艺术展中前所未有,至今仍属罕见。
And it was a huge blockbuster, attracting more than 2 million visitors in its first four months alone, or nearly 17,000 people a day. It attracted five times as many visitors as the exhibit at the Haus der Kunst. These numbers were and remain unheard-of for an art exhibition.
为了直观地了解展览的参观人数,我们可以看看 2011 年世界上参观人数最多的艺术展——巴西银行文化中心举办的埃舍尔魔法世界,该展览每天吸引 9,677 人,仅为“醉人的艺术”展览人流量的一半多一点。2010年,纽约现代艺术博物馆举办了一场大型展览——抽象表现主义纽约,其主题与“醉人的艺术”有些重叠,因为展览展出的是该地区的现代艺术家。这场展览也是当年规模最大的展览之一,在七个月内吸引了 110 万人,即每天约 5,600 人——但仍然只是“醉人的艺术”展览的一小部分。
To give a sense of how well attended the exhibit was, consider that in 2011, the best-attended art exhibit in the world, the Centro Cultural Banco do Brasil’s Magical World of Escher, attracted 9,677 people per day, little more than half the traffic of Entartete Kunst. In 2010, New York’s Museum of Modern Art put on a major exhibition, Abstract Expressionist New York, whose subject matter overlapped somewhat with Entartete Kunst in that it was an exhibition of modern artists from the region. This exhibition, too, was one of the biggest of the year, drawing 1.1 million people over seven months, or about 5,600 people per day—still, just a fraction of Entartete Kunst.
展览受欢迎并非只是统计数据。庞大的观众群增强了观展体验,他们也成为了展览的一部分。一位参观者如此描述:
The fact that the exhibition was popular is not merely a statistic. The massive crowds amplified the experience, becoming a part of the display. Here’s how one visitor described it:
我感到一阵压倒性的幽闭恐惧。一大群人推推搡搡,互相嘲讽,并宣称自己不喜欢这些艺术品,这给人一种上演的印象,意在挑起攻击性和愤怒的气氛。人们一遍又一遍地大声念着购买价格,然后大笑、摇头,或者要求“退还”他们的钱。
I felt an overwhelming sense of claustrophobia. The large number of people pushing and ridiculing and proclaiming their dislike for the works of art created the impression of a staged performance intended to provoke an atmosphere of aggressiveness and anger. Over and over again, people read aloud the purchase prices and laughed, shook their heads, or demanded “their” money back.
因此,“堕落的艺术”是视觉艺术与行为艺术的混合体,它以一种低俗且误导的方式展示现代艺术作品,旨在激起公众的愤怒和蔑视——所有这些都造就了个体参观者的遭遇。很快,这场轰动一时的活动开始在各个城市之间传播,将其嘲讽的信息传播到整个德国。总共只有5%到10%的德国人参观了“堕落的艺术”。可悲的是,“堕落的艺术”却成了有史以来最受欢迎的艺术展览。
Thus, Entartete Kunst was a hybrid of visual and performance art, displaying modern artwork in a tasteless and misleading way in order to incite public anger and scorn—all of which created the individual visitor’s encounter. Soon, the smash hit began traveling from city to city, carrying its derisive message across Germany. All in all, between 5 and 10 percent of Germans paid a visit. Tragically, Entartete Kunst was the most popular art exhibition of all time.
《堕落的艺术》之后,在德国成为现代艺术家几乎是不可能的。贝克曼、恩斯特、克利和其他几位艺术家逃离了德国。留下来的人被禁止创作艺术。埃米尔·诺尔德面临这样的禁令,却偷偷地继续用水彩作画,以免颜料的气味暴露他的行踪。恩斯特·路德维希·基希纳完成了纳粹的阴谋:他自杀了。
After Entartete Kunst, it was effectively impossible to be a modern artist in Germany. Beckmann, Ernst, Klee, and several other artists fled the country. Those who remained were forbidden to create art. Emil Nolde, facing such a ban, secretly continued painting in watercolor so that the smell of paint would not give him away. Ernst Ludwig Kirchner finished the job the Nazis had begun: He committed suicide.
那么夏加尔呢?即使他的名字正在迅速从德国文化中消失,生活在法国的夏加尔最初也幸免于暴力。但1940年法国沦陷后,夏加尔意识到自己的生命受到了威胁。他和家人持伪造签证前往美国。
And what of Chagall? Even as his name was being rapidly effaced from German culture, Chagall, living in France, was initially safe from physical violence. But when France fell in 1940, Chagall realized that his life was in danger. Using forged visas, his family left for the United States.
这些根据德语出版的书籍计算得出的ngram,清晰地展现了纳粹镇压对夏加尔及其同时代艺术家的影响。1936年至1943年间,马克·夏加尔的全名在我们的德国书籍记录中只出现过一次。纳粹没能杀死夏加尔,但他们找到了抹去他的方法。
These ngrams, computed from books published in the German language, make the effects of Nazi suppression on Chagall and his contemporaries crystal clear. Between 1936 and 1943, Marc Chagall’s full name appears only once in our German book records. The Nazis did not manage to kill Chagall. But they found a way to erase him.
纳粹政权对德国文化的操控远不止现代艺术,它塑造了德国思想的方方面面。任何被纳粹政权视为不合适的概念都会成为攻击目标。在这场针对思想的运动中,书籍是不可避免的早期战场。希特勒宣誓就任总理不到十周,这场战斗就此展开。
The Nazi regime’s manipulation of German culture extended far beyond modern art, shaping every aspect of German thought. Any concept that the regime deemed unsuitable was a target. In this campaign against ideas, books were an inevitable and early battleground. Less than ten weeks after Hitler was sworn in as chancellor, the battle was joined.
纳粹的影响在德国社会根深蒂固,以至于这场斗争的开场并非直接来自政府。1933年4月,德国最大的学生会——德国学生会(Deutsche Studentenschaft)发起了一场全国性的运动,旨在清除德国文化中的不良思想。几天之内,学生们为了有意识地效仿马丁·路德的做法,德国各地都张贴着海报,上面列出了“反对非德国精神的十二条论纲”。以下是第七条论纲:
Nazi influence had so deeply penetrated German society that the opening salvo in this battle did not come directly from the government. In April 1933, the principal student union in Germany, called the Deutsche Studentenschaft, initiated a nationwide campaign to cleanse German culture of undesirable ideas. Within days, in a conscious attempt to echo Martin Luther, the students put up posters all over Germany, listing “12 Theses Against the Un-German Spirit.” Here is thesis number 7:
我们要将犹太人视为异族,我们要尊重德国人民(Volk)的传统。因此,我们要求审查员:犹太作品必须以希伯来语出版。如果以德语出版,必须注明其为译文。我们将采取最强硬的措施,反对滥用德语文字。德语文字只供德国人使用。非德意志精神应从公共图书馆中根除。
We want to regard the Jew as alien and we want to respect the traditions of the Volk [the German people]. Therefore, we demand of the censor: Jewish writings are to be published in Hebrew. If they appear in German, they must be identified as translations. Strongest actions against the abuse of German script. German script is only available to Germans. The un-German spirit is to be eradicated from public libraries.
在纳粹运动的笼罩下,德国学生会逐渐相信,德国问题的根源在于图书馆,其中就包括那些体现“非德国精神”的书籍。但学生们面临着一个难题:众所周知,很难读完图书馆里的所有书籍。他们怎么知道哪些书体现了“非德国精神”呢?
In the thrall of the Nazi movement, the Deutsche Studentenschaft had come to believe that the roots of German problems lay, among other places, in libraries, in the form of works reflecting the “un-German spirit.” But the students had a problem: As we know, it’s hard to read all the books in the library. How would they know which ones reflected the “un-German spirit”?
为此,他们找到了沃尔夫冈·赫尔曼 (Wolfgang Herrmann),一位于1931年加入纳粹党的图书管理员。赫尔曼默默无闻,经常失业,多年来一直在书架上仔细搜寻,整理出他认为会造成不良道德影响的书籍清单。赫尔曼对这份个人爱好极其执着,他为形形色色的作家(包括政治家、文学家、哲学家和历史学家)分别列出了书单。
For this, they needed Wolfgang Herrmann, a librarian who had joined the Nazi Party in 1931. Obscure and often unemployed, Herrmann had spent years combing the stacks to compile lists of books that he thought were a bad moral influence. Herrmann was extremely meticulous in this personal obsession, creating separate lists for all sorts of authors, including politicians, literary writers, philosophers, and historians.
他的努力原本没什么用,但随着希特勒掌权,赫尔曼的声望也随之上升。赫尔曼被任命为负责整顿柏林图书馆的“净化委员会”成员,他突然有机会发起一场反对他所谓的德国“文学妓院”的运动。德国学生会找到赫尔曼,请他把自己精心整理的图书清单分享给他们。赫尔曼欣然接受了。几个月之内,这位曾经默默无闻的图书管理员就拥有了一支可以调遣的军队,德国的图书馆也成了他的目标。
None of his efforts would have mattered much, except that, as Hitler rose to power, Herrmann’s profile rose, too. Named to a “purification committee” tasked with overhauling Berlin’s libraries, Herrmann was suddenly in a position to begin his own campaign against what he called Germany’s “literary bordellos.” The Deutsche Studentenschaft turned to Herrmann to ask him to share his meticulously curated lists with its campaign. These he willingly provided. Within months, the once-obscure librarian had an army at his disposal and Germany’s libraries in his sights.
1933年5月10日,最初的运动达到了高潮:大清洗(Säuberung)。学生们手持火把,拿着赫尔曼的名单,走上德国大部分大学城的街头,抢劫书店、借阅图书馆和学校,焚烧了数以万计的书籍。在柏林,他们由戈培尔亲自领导,他宣称“极端的时代”犹太人的知识分子主义如今走到了尽头……未来的德国人不仅要博览群书,更要有品格。” 五月底,德国各地都发生了焚书事件。盖世太保没收了五百吨书籍。被焚毁的书籍包括卡尔·马克思、F·斯科特·菲茨杰拉德、阿尔伯特·爱因斯坦、赫伯特·乔治·威尔斯、海因里希·海涅,当然还有海伦·凯勒的作品。
On May 10, 1933, the initial campaign reached its climax: the Säuberung (cleansing). Outfitted with torches and Herrmann’s lists, students took to the streets of most of Germany’s university towns, raiding bookstores, lending libraries, and schools, consigning tens of thousands of books to the flames. In Berlin, they were led by Goebbels himself, who announced that “the era of extreme Jewish intellectualism is now at an end. . . . The future German man will not just be a man of books, but a man of character.” By the end of May, there had been book burnings all over Germany. Five hundred tons of books had been confiscated by the Gestapo. The burned books included works by Karl Marx, F. Scott Fitzgerald, Albert Einstein, H. G. Wells, Heinrich Heine, and of course Helen Keller.
然而,即使是五月的焚书行动,也仅仅是纳粹对德国书籍发起的长期攻击的开始。赫尔曼不断修改他的名单,名单上的作家数量从1933年的约五百人激增至1938年的数千人,成为纳粹政权支持的不断扩大的黑名单的核心。这场持续的攻击造成了毁灭性的影响。图书管理员兼图书馆历史学家玛格丽特·斯蒂格·道尔顿估计,到1938年,纳粹工业中心埃森的公共图书馆中,纳粹政权统治前的藏书有69%已被清除。其中包括许多流传最广的书籍。在没有互联网的世界里,从公共领域清除如此多的信息将带来的影响难以想象。
Yet even the May book burnings were only the beginning of a protracted attack by the Nazis on Germany’s books. Herrmann kept revising his lists, and they swelled from about five hundred authors in 1933 to thousands of writers by 1938, becoming the core of a continually expanding blacklist supported by the regime. This sustained attack had a devastating impact. Margaret Stieg Dalton, a librarian and a historian of libraries, estimates that 69 percent of the books present prior to the regime in the public library in Essen, a Nazi industrial center, had been removed by 1938. These included many of the most widely circulated books. In a world without the Internet, the impact of removing so much information from the public sphere can hardly be imagined.
虽然很难想象纳粹所创造的世界,许多对我们今天来说最重要的思想都被从国家话语中抹去,但我们仍然可以利用 ngrams 数据,从统计数据中洞察纳粹审查运动的有效性。下表显示了赫尔曼各种黑名单上作家的名气。为了进行比较,我们还附上了一份纳粹分子名单。
Although it’s hard to envision the world that the Nazis created, in which many of the ideas that are most important to us today had been effaced from the national discourse, we can still get statistical insight into the efficacy of their censorship campaign by using ngrams. The chart below shows the fame of authors listed on Herrmann’s various blacklists. For comparison, we include a list of Nazis as well.
被列入黑名单的知识分子与纳粹政权相关人员的名声形成了鲜明对比,纳粹镇压的效力令人恐惧。
The contrast between the fame of the blacklisted intellectuals and that of people linked to the Nazi regime could not be more obvious. It renders the efficacy of Nazi suppression terrifyingly apparent.
值得一提的是,赫尔曼的宣传运动并非在所有学科领域都同样有效,这或许令人惊讶。例如,在第三帝国时期,被列入黑名单的哲学和宗教书籍作者的名气下降了四倍。政治类作家的名气下降了一半:虽然降幅小于哲学家,但仍然显著。相比之下,他列入黑名单的历史学家的影响则更为有限,降幅仅为10%左右。利用ngram,我们可以比以往更敏锐地洞察纳粹反对思想运动的轮廓。
An additional observation can be made. Perhaps surprisingly, Herrmann’s campaign was not equally effective in all disciplines. For instance, the fame of authors of philosophy and religion books included on his blacklist declined fourfold during the Third Reich. The fame of authors writing about politics declined by half: less than that of the philosophers, but still pronounced. In contrast, his blacklist of historians had a more limited effect; the decline was only about 10 percent. Using ngrams, we can perceive the contours of the Nazi campaign against ideas more keenly than ever before.
纳粹政权无疑是记录最详尽的大规模政治和文化压制案例。尽管它是一个极端的例子,却绝非唯一。大数据如同强大的探照灯,可以揭示世界各地的审查制度。其中一些比我们想象的更近在咫尺。
The Nazi regime is without a doubt the best-documented case of large-scale political and cultural suppression. But although it is an extreme example, it is hardly the only one. Like a powerful searchlight, big data can reveal instances of censorship all over the world. Some of them are closer than we might like to think.
列宁领导的俄国革命建立了苏维埃社会主义共和国联盟几年后,他患上了中风,领导能力大打折扣。权力斗争随即爆发。与列宁一起领导布尔什维克的列夫·托洛茨基原本有望接替列宁。但三位革命英雄——约瑟夫·斯大林、格里高利·季诺维也夫和列夫·加米涅夫——结成政治联盟,意图削弱托洛茨基。三驾马车的策略大获成功,导致托洛茨基在共产党第十三次代表大会上遭到正式谴责,并被三人取而代之。托洛茨基被打倒后,斯大林转而攻击他的同谋。到 1925 年,三驾马车解散,斯大林成为苏联的唯一领导人。
A few years after he guided the Russian Revolution that established the Union of Soviet Socialist Republics, Lenin suffered a stroke that compromised his ability to lead. A struggle for power immediately broke out. Leon Trotsky, who led the Bolsheviks along with Lenin, had been expected to succeed him. But three heroes of the revolution—Joseph Stalin, Grigory Zinoviev, and Lev Kamenev—formed a political alliance to undermine Trotsky. The troika’s strategy succeeded brilliantly, leading to the official denunciation of Trotsky at the XIII Conference of the Communist Party and their ascendance in his place. Once Trotsky had been neutralized, Stalin turned on his co-conspirators. By 1925, the troika had dissolved, and Stalin was the sole leader of the USSR.
但斯大林并不满足于简单的晋升。为了追求绝对权力,他开始系统性地压制每一个潜在的竞争对手,毫不留情地清除长期的敌人和新朋友。像季诺维也夫和加米涅夫这样的人物被孤立、开除出党、接受审判,并于1936年在如今被称为大清洗的事件中被处决。尽管托洛茨基已经流亡墨西哥,却在同一场审判中被缺席判处死刑。他的日子屈指可数:1940年,斯大林派刺客拉蒙·默卡德前往执行法庭的判决。革命英雄托洛茨基在墨西哥因头部被斧头砍伤而死亡。
But Stalin was not satisfied with a simple promotion. In his quest for absolute power, he began a systematic campaign to suppress every potential rival, getting rid of long-standing enemies and recent friends with equal dispatch. Figures like Zinoviev and Kamenev were isolated, expelled from the Party, put on trial, and, in 1936, executed during what is known today as the Great Purge. Already exiled in Mexico, Trotsky was nonetheless condemned to death in absentia during those same trials. His days were numbered: In 1940, Stalin sent the assassin Ramón Mercader to execute the court’s judgment. Trotsky, hero of the revolution, died in Mexico of an ax blow to the head.
然而,即使是这个故事也未能完全展现斯大林对其对手的影响。他的目标不仅仅是杀死他们。他想抹去他们所有贡献的记录,将他们从同胞的记忆中抹去,只留下他自己作为革命的核心英雄。总的来说,斯大林成功了。
Yet even this story doesn’t fully capture the impact Stalin had on his rivals. His goal was not merely to kill them. He wanted to wipe out any record of their contributions, to erase them from the memory of their countrymen, leaving himself, alone, as the central hero of the revolution. And by and large, Stalin succeeded.
在他们被处决后的近半个世纪里,托洛茨基、季诺维也夫、加米涅夫以及无数其他人的贡献都被最小化和忽视。正如“ngrams”所揭示的,这三个人的声誉在“大清洗”之后急剧下降。无论是斯大林的去世,还是1956年尼基塔·赫鲁晓夫公开否定“大清洗”,都没能让他们恢复应有的历史地位。他们的声誉最终会得到部分恢复。但这需要几代人的时间:直到80年代 末米哈伊尔·戈尔巴乔夫推行的“改革”( perestroika)和“公开化”( glasnost )改革,我们才看到“ngrams”的反弹。
For nearly half a century after their executions, the contributions of Trotsky, Zinoviev, and Kamenev, along with countless others, were minimized and ignored. As the ngrams reveal, the fame of all three drops precipitously after the Great Purge. Neither Stalin’s death nor the public repudiation of the Great Purge by Nikita Khrushchev in 1956 managed to restore them to their rightful place in history. Their reputations would, eventually, be partially rehabilitated. But this took generations: We don’t see the ngrams rebound until the perestroika (restructuring) and glasnost (openness) reforms ushered in by Mikhail Gorbachev in the late ’80s.
斯大林并非唯一一个惧怕这些老布尔什维克革命者及其危险思想的人。二战后的美国,对共产主义的担忧日益加剧。美国有共产党员吗?如果有,他们在哪里?他们在做什么?为了确保充分调查,众议院于1945年成立了一个特别常设委员会:众议院非美活动调查委员会。
Stalin was not the only person who feared these old Bolshevik revolutionaries and their dangerous ideas. In post–World War II America, anxiety about communism was on the rise. Were there communists in the United States? If so, where were they, and what were they up to? To ensure an adequate investigation, the House of Representatives established a special standing committee in 1945: the House Un-American Activities Committee.
由于担心电影业可能成为外国宣传的秘密来源,委员会将共产主义者对好莱坞的影响作为关注重点之一。在1947年的听证会上,委员会首先听取了友好证人的证词,这些好莱坞名人的爱国主义精神国会议员们并不质疑。其中包括沃尔特·迪士尼和罗纳德·里根(当时,里根只是美国演员工会主席),他们谈到了共产主义对他们电影业的严重威胁。很快,委员会转向了那些被怀疑与共产主义有联系的不友好证人的证词,希望他们能透露所知道的情况并点名道姓。迫于压力,大多数人同意作证。但有十人拒绝了:阿尔瓦·贝西、赫伯特·比伯曼、莱斯特·科尔、爱德华·德米特里克、小林·拉德纳、约翰·霍华德·劳森、阿尔伯特·马尔茨、塞缪尔·奥尼茨、阿德里安·斯科特和道尔顿·特伦博。他们中的许多人在各自的行业中都取得了巨大的成功,甚至获得了奥斯卡奖。如今,它们被统称为好莱坞十杰。
Fearing that the film industry could become a clandestine source of foreign propaganda, one focus area for the committee was the influence of communists on Hollywood. In its 1947 hearings, the committee began by listening to the testimony of friendly witnesses, Hollywood personalities whose patriotism the congressmen did not question. Several of them, including Walt Disney and Ronald Reagan (at the time, he was only president of the Screen Actors Guild), spoke of a grave communist threat to their industry. Soon the committee turned to unfriendly witnesses who were suspected of communist ties, hoping that they would reveal what they knew and that they would name names. Under pressure, most agreed to testify. But ten refused: Alvah Bessie, Herbert Biberman, Lester Cole, Edward Dmytryk, Ring Lardner, Jr., John Howard Lawson, Albert Maltz, Samuel Ornitz, Adrian Scott, and Dalton Trumbo. Many of them had been extremely successful in their trade, even winning Academy Awards. Today, they are collectively known as the Hollywood Ten.
由于拒绝作证,“好莱坞十人”被控藐视国会。更糟糕的是,48位好莱坞重要制片人(包括塞缪尔·高德温和路易斯·B·梅耶等人物)也加入进来,急于巩固他们的反共立场。制片人发表声明,宣布“好莱坞十人”中任何一人都不得为他们的工作室“直到他被宣告无罪或洗清藐视法庭罪名并宣誓声明自己不是共产党员为止。”
On account of their refusal to testify, the Hollywood Ten were cited for contempt of Congress. Worse still, forty-eight important Hollywood producers (including such figures as Samuel Goldwyn and Louis B. Mayer) weighed in, eager to bolster their anti-communist creds. The producers issued a statement announcing that not one of the Hollywood Ten would be allowed to work for their studios “until such time as he is acquitted or has purged himself of contempt and declares under oath that he is not a communist.”
制片人用这些话建立了一个黑名单,阻止“好莱坞十人”以及后来许多其他人在美国找到工作。十多年来,“好莱坞十人”成员从未在各大电影公司制作的任何一部电影中出现过。这对他们的生活和事业造成了直接且毁灭性的影响。
With those words, the producers established a blacklist that prevented the Hollywood Ten, and later many others, from finding work in the United States. No member of the Hollywood Ten was credited in a movie produced by the major studios for over a decade. The impact on their lives and careers was immediate and devastating.
直到50年代中期参议员约瑟夫·麦卡锡下台后,众议院非美活动调查委员会的权力才开始减弱。(尽管他的目标往往与此类似,但值得注意的是,身为参议员的麦卡锡并没有参与众议院的这项动议。)前总统哈里·杜鲁门在1959年发表的言论,为这一逆转画上了句号:非美活动调查委员会是“这是当今美国最不美国的事情。”失去公众同情后,黑名单即将崩溃。最终,在1960年,黑名单当达尔顿·特伦博被誉为这部恰如其名的电影的编剧时,出埃及记。好莱坞的流亡者已经回到了应许之地。
It was only after the downfall of Senator Joseph McCarthy in the mid-’50s that the power of the House Un-American Activities Committee began to wane. (Although his goals were often similar, it is important to note that McCarthy, a senator, did not play a role in the House initiative.) Former president Harry Truman put an exclamation point on this reversal with his remark in 1959 that the Un-American Activities Committee was the “most un-American thing in the country today.” Stripped of public sympathy, the blacklist was primed for collapse. Finally, in 1960, the blacklist was violated, when Dalton Trumbo was credited as the screenwriter for the aptly named film Exodus. Hollywood’s exiles had returned to the promised land.
我们的历史充满了压迫,以至于我们很容易陷入一个又一个的讨论。但压迫如今也在发生——或许比以往任何时候都更加严重。最好的例子之一就是北京的遗产天安门。
Our history is so full of suppression that it’s easy to get caught up talking about one example after another. But suppression is also happening today—perhaps more than ever before. One of the best examples is the legacy of Beijing’s Tiananmen Square.
近年来,天安门广场发生了两起特别臭名昭著的事件。
Two particularly notorious incidents have taken place in Tiananmen Square in recent memory.
1976年,中国执政的“四人帮”镇压了天安门广场的抗议和公众悼念活动。受人尊敬的总理周恩来逝世的鼓舞,约有一万人聚集在一起。尽管广场被武力清场,但据信没有人员伤亡。1976年的事件在中国的ngram记录中留下了巨大的印记,天安门事件的提及次数大幅飙升()。
In 1976, China’s ruling Gang of Four cracked down on protests and public mourning in Tiananmen Square. Spurred by the death of venerated premier Zhou Enlai, about ten thousand people had come together. Although the square was cleared by force, no lives are believed to have been lost. The 1976 incident leaves a huge fingerprint in the Chinese ngram record, with a massive spike in mentions of Tiananmen ().
但在西方眼中,更臭名昭著的事件是1989年的天安门广场屠杀。这一次,由于重要官员、改革派总书记胡耀邦的去世,学生们涌向广场悼念。公众的哀悼再次演变成抗议活动,据说参与人数多达一百万人。作为回应,政府宣布戒严,并派遣三十万军队进入首都。1989年6月4日,军队抵达广场,进行了极其暴力的镇压。死亡人数——至今仍不清楚——据信有数千人。
But the far more infamous event—in the eyes of the West—was the Tiananmen Square massacre of 1989. Prompted this time by the demise of an important official, pro-reform general secretary Hu Yaobang, mourning students took to the square. Again, the public display of sorrow became a protest, in which as many as one million people are said to have participated. In response, the government declared martial law and sent three hundred thousand troops into the capital. On June 4, 1989, the troops reached the square and cleared it in an extremely violent crackdown. The number of deaths—still unknown today—is believed to have been in the thousands.
1989年天安门大屠杀理应成为所有中国异见人士的战斗口号、一个导火索和中国文化的基石。
By all rights, the 1989 Tiananmen Square massacre ought to be the rallying cry of all Chinese dissidents, a flash point and a fixture in Chinese culture.
屠杀发生后,中国政府官员迅速采取行动,开展了一场速度惊人、效果显著的审查和信息压制运动。一年之内,中国超过10%的报纸以及众多出版社被关闭。至今,所有报道屠杀的印刷媒体都必须与政府的报道保持一致。作为中国大规模互联网审查运动的一部分,数字媒体也受到监控,该运动通常被称为““中国防火长城。”在互联网上搜索“天安门广场”的人会看到经过仔细审查的搜索结果列表。(谷歌从2006年到2010年同意参与中国的封锁,但后来终止了合作。)因此,今天中国的许多年轻人对1989年六四事件知之甚少。在接受采访时,北京大学的本科生似乎甚至认不出“坦克人”的形象——它在其他地方都很有标志性——描绘的是一名反抗的天安门广场抗议者站在一列中国坦克前。
After the massacre, Chinese government officials sprang into action, carrying out a campaign of censorship and information suppression that was remarkable for its speed and efficacy. Within a year, more than 10 percent of China’s newspapers had been shut down, along with numerous publishing houses. To this day, all print media describing the massacre is required to be consistent with the government’s account. Digital media is also monitored as part of China’s extensive Internet censorship campaign, often referred to as the “Great Firewall of China.” Those who search the Internet for Tiananmen Square see a carefully sanitized list of results. (From 2006 to 2010, Google agreed to participate in China’s blockade, although it has since ended its cooperation.) As a result, many young people in China today know little, if anything, about the events of June 4, 1989. When quizzed, undergraduates at Beijing University appeared not even to recognize the “tank man” image—iconic elsewhere—showing a defiant Tiananmen Square protester standing in front of a column of Chinese tanks.
在西方,1989年屠杀之后,天安门事件的提及率飙升。在中国,人们对此事的兴趣曾短暂上升——甚至难以达到1976年的水平——之后便恢复正常。
In the West, mentions of Tiananmen soar after the 1989 massacre. In China, there is a transient blip of interest—hardly approaching even 1976 levels—after which things go back to normal.
天安门屠杀是当代中国历史的核心事件之一。但中国从未讨论过此事,至少在出版物中没有。许多人甚至可能对此一无所知。这张令人心碎的图表证明了当代中国审查制度的残酷效率。
The Tiananmen Square massacre is one of the central events of contemporary Chinese history. But nobody there ever discusses it, at least not in print. Many may not even know about it. This heartbreaking chart is a testament to the brutal efficiency of censorship in contemporary China.
无论发生在哪里,审查和压制通常都会留下一个特征:某些词语和短语的突然消失。这种词汇空缺的统计特征通常非常明显,以至于我们可以通过数字——大数据——来帮助弄清楚到底是什么被压制了。
No matter where they take place, censorship and suppression often leave a characteristic mark: the sudden disappearance of particular words and phrases. The statistical signature of this lexical vacancy is often so strong that we can use numbers—big data—to help figure out what is being suppressed.
让我们回到纳粹德国,看看这是如何发生的。我们的目标是寻找那些在1933年至1945年第三帝国时期,名气出现类似夏加尔式下降的人。我们可以通过比较一个人在第三帝国时期的名气与其前后的名气来衡量下降的幅度。如果他在20世纪20年代和50年代被提及的频率是千万分之一,但在纳粹统治时期却下降到一亿分之一,那就是下降了十倍。这表明这个人可能受到了某种形式的审查或压制。另一方面,如果在纳粹统治时期,他的提及频率上升到百万分之一,增长了十倍,那么这个人在纳粹统治时期就特别出名,并且可能受益于政府的宣传。这样一来,我们可以选取任何名字,并赋予其一个压制分数,反映其下降或上升的幅度。这反过来又能帮助我们找出哪些人受到了周围社会的压制。
Let’s go back to Nazi Germany to see how this works. Our goal will be to look for people whose fame shows a Chagall-like drop during the Third Reich, 1933 to 1945. We can measure the size of this drop by comparing a person’s fame during the Third Reich to the person’s fame before and after. If his frequency of mention is one in ten million in the ’20s and the ’50s, but goes down to one in one hundred million during the Nazi regime, that’s a tenfold drop. It suggests that the person was probably being censored or suppressed in some way. On the other hand, if the frequency goes up to one in a million during the Nazi era, a tenfold increase, then the person was particularly famous during the regime, and may have been benefiting from government propaganda. In this way, we can take any name and assign it a suppression score, reflecting the magnitude of the drop or of the increase. This in turn helps us figure out who was being suppressed by the surrounding society.
我们将这个自动检测器应用于数千个二战期间在世的名人姓名,并制作了两张图表。第一张图表显示了英语的抑制分数。大多数分数接近1:没有下降,也没有上升。只有不到1%的人的分数大于5。这张图表并没有什么特别之处:英语的结果很典型,与我们在几乎所有时期、几乎所有语言中看到的结果非常相似。
We applied this automated detector to thousands of names of famous people who were alive during the Second World War, and made two charts. The first chart shows the suppression scores we get for English. Most of the scores are close to one: no drop, no increase. Fewer than 1 percent of people have a score bigger than five in either direction. This chart is nothing special: The results for English are typical, and closely resemble what we see in almost all languages during almost all periods of time.
第二张图表显示的是纳粹统治时期德语的学习结果。结果看起来完全不同。首先,它不再以某一方为中心,而是稍微偏左了一点。当时大多数人至少在某种程度上受到了政权的压制;大多数人的名声都大幅下降。但这不仅仅是中心位置的移动。分布范围也更加广泛,包含更多极端值。其中一些位于右侧,我们预计他们会成为政府宣传的受益者。但大多数位于极左:我们榜单上超过10%的人名气下降了五倍或更多。
The second chart shows the results for the German language during the Nazi regime. It looks completely different. First, it isn’t centered on one, but a little to the left. Most people were being suppressed, at least somewhat, by the regime; a majority suffer significant drops in fame. But it’s not just that the center has moved. The distribution is also much wider, containing far more extreme values. A few of these are on the right, where we expect to see the beneficiaries of government propaganda. But most are on the far left: More than 10 percent of the people on our list suffer a fivefold or more drop in fame.
左边的名字包括毕加索,还有包豪斯艺术、建筑和设计运动的创始人瓦尔特·格罗皮乌斯。如果你尽可能地向左走,你会发现赫尔曼·马斯的名字,他是一位新教牧师,公开谴责纳粹,并帮助犹太人获得出境签证逃离德国。由于他的努力,德国政府将他作为个人运动的目标。我们当然不是第一个注意到马斯非凡英雄事迹的人:1964年,以色列大屠杀纪念馆亚德瓦谢姆将马斯列为国际义人之一。
The names on the left include Picasso. They include Walter Gropius, founder of the Bauhaus movement in art, architecture, and design. And if you go as far left as you can, you find the name of Hermann Maas, a Protestant minister who publicly denounced the Nazis and helped Jews obtain exit visas to flee Germany. For his efforts, the Reich made him the target of a personal campaign. We’re certainly not the first to notice Maas’ extraordinary heroism: In 1964, Yad Vashem, Israel’s Holocaust memorial, recognized Maas as one of the Righteous Among the Nations.
制作完这张图表后,我们请一位来自亚德瓦谢姆的学者,仅使用普通历史学家的工具,自行判断哪些名字会出现在曲线的哪一端。我们没有让她接触我们的数据或结果,甚至没有告诉她我们为什么要问这个问题。她从我们这里得到的只是一份名单。然而,她的答案绝大多数时候都与我们的一致。
After we made this chart, we asked a scholar from Yad Vashem to make her own personal determination, using only the tools of an ordinary historian, about which names would appear at which end of the curve. We didn’t give her access to our data or to our results, and we didn’t even tell her why we were asking. All she got from us was the list of names. Nevertheless, her answers agreed with ours the vast majority of the time.
因此,我们的统计审查检测技术得出的结果在质量上与传统历史学家使用传统方法得出的结果相似。但与传统方法不同的是,我们的分析几乎可以立即通过计算机完成。
So our statistical censorship-detection technique gives results that are qualitatively similar to those of a traditional historian using traditional methods. But unlike traditional methods, our analysis can be done almost instantaneously, by a computer.
像这样的自动化方法对我们的日常生活有着巨大的潜力。我们都希望能够识别审查、压制,甚至仅仅是普通的偏见对我们每天消费的信息的影响。如今,审查监督组织试图通过仔细阅读媒体内容来提供帮助。研究人员会针对感兴趣的领域和主题,重点突出他们发现的遗漏。但随着信息量越来越大,阅读所有内容,甚至只阅读其中的重要部分,变得越来越不可能。我们需要替代方案。大数据就是其中一种强大的替代方案。
Automated methodologies like this one hold a great deal of potential for our day-to-day lives. We all want to be able to identify the effects of censorship, suppression, and even just ordinary bias on the information that we consume every day. Today, censorship watchdog organizations try to help by carefully reading media in a region of interest and on a topic of interest and highlighting the omissions they find. But as more and more information is produced, it is becoming impossible to read everything, or even a significant slice of everything. We need alternatives. Big data is a powerful one.
有趣的是,维基百科最近开始利用这种大数据方法来检测偏见。长期以来,维基百科一直存在着对女性的歧视,这可能是因为维基百科的大多数编辑都是男性。这一讨论主要依赖于传闻证据。但新的尝试正在将统计方法和ngram数据引入到这一讨论中。目标是清晰地识别出存在问题的趋势和条目,以便纠正这些缺陷。
Interestingly, Wikipedia has recently begun to take advantage of this big data approach to bias detection. There has been a long-standing discussion of an anti-female bias in Wikipedia, perhaps due to the fact that most Wikipedia editors are male. That discussion has relied primarily on anecdotal evidence. But new efforts are bringing statistical methods and ngram data into this dialogue. The goal is to clearly identify problematic trends and articles so that the shortcomings may be addressed.
未来,此类方法将不再局限于主要由善意志愿者运营的网站。它们还将致力于维护政府的廉洁,维护人民的自由,维护思想的自由。
In the future, such methods won’t be limited to Web sites staffed by, for the most part, volunteers acting in good faith. They’ll also serve to keep governments honest, and their people, and ideas, free.
短短几年间,纳粹便在抹杀大量艺术理念方面大展身手。他们不喜欢现代艺术,所以让所有艺术作品消失,只有在“堕落艺术”(Entartete Kunst)展览中那些低劣的展览才算例外。像夏加尔这样的现代艺术家被驱逐出欧洲,被迫退休,甚至被杀害。这场艺术运动几乎在德国销声匿迹。
In only a few short years, the Nazis went a very long way toward wiping out a great many ideas. They didn’t like modern art, so they made works of art disappear, making an exception only for the demeaning presentations at Entartete Kunst. Modern artists like Chagall were driven from Europe, forced into retirement, or killed. The movement all but disappeared from Germany.
那么,我们该如何理解凯勒的“如果你认为你可以扼杀思想,那么历史就没有教会你任何东西”这一观点呢?
So what should we make of Keller’s notion that “history has taught you nothing if you think you can kill ideas”?
一方面,这些思想得以留存——我们现在正在谈论它们。但另一方面,假装事情必然如此发展未免有些轻率。希特勒输掉了战争。如果历史的走向不同,或许他反对思想的运动也会有不同的结局。
On the one hand, the ideas have survived—we’re talking about them right now. But on the other, it’s a bit facile to pretend that this is how things had to work out. Hitler lost the war. If history had turned out differently, perhaps his campaign against ideas might have turned out differently, too.
然而,任何关于审查制度的讨论,如果不触及压迫性政权所用手段的意外后果,都是不完整的。想象一下,你是一位生活在德国的年轻艺术家,尽管承受着巨大的社会压力,却依然对现代艺术保持着浓厚的兴趣。如果是这样,你可能会被“堕落艺术”(Entartete Kunst)展览所吸引,那里展出了你心目中许多偶像的作品。你可以把它想象成一个教室——非常宽敞,非常喧闹,但无论如何,它都是一个大师班。
Yet any discussion of censorship would be incomplete if it didn’t touch on the unintended consequences of the very tactics used by oppressive regimes. Imagine that you were a young artist living in Germany who, despite extraordinary social pressure, remained interested in modern art. If so, you would probably be attracted to the Entartete Kunst exhibit, where many of the works of your heroes were on display. You could imagine it as a classroom of sorts—very large and very rowdy, but a master class, nonetheless.
这确实发生了。1936年,夏洛特·萨洛蒙成功考入柏林美术学院,成为那里唯一的犹太学生。她甚至在那里得过奖,尽管后来因为“种族原因”被撤销。萨洛蒙对现代艺术非常感兴趣,而“堕落艺术”展览的到来对她来说是一个难得的机会。毕竟,纳粹政权刚刚收藏了世界上许多最重要的现代艺术作品,并将它们摆放在她家门口。更棒的是,只要她能不去理会那些嘲笑的人群,这些作品就可以连续几个月供她欣赏。
This actually happened. In 1936, Charlotte Salomon managed to get admitted to the Berlin Academy of Fine Art, where she was the only Jewish student. She even won a prize there, although it was later retracted “on racial grounds.” Salomon was very interested in modern art, and when the Entartete Kunst exhibition came to town, it was an extraordinary opportunity for her. After all, the Nazi regime had just collected many of the world’s most important works of modern art and placed them conveniently at her doorstep. Better yet, they were available to be seen for months on end—so long as she managed to ignore the jeering throngs.
萨洛蒙深受“堕落艺术”展览作品的启发,并从中受益匪浅。后来,她运用多种现代艺术技巧,创作了二十世纪最杰出的自传之一。萨洛蒙的母亲、姑姑和祖母都自杀身亡。在她以第三人称叙述的回忆录中,一个黑暗的童话故事讲述了一个名叫夏洛特的女孩的故事——她的分身经历了一个令人心碎的决定:“是自杀,还是去做一些非常不寻常的事情。”
Salomon was deeply inspired by the works in the Entartete Kunst exhibition and learned a great deal from them. She later deployed many of the techniques of modern art to create one of the most remarkable autobiographies of the twentieth century. Salomon’s mother, aunt, and grandmother had all committed suicide. In her memoir—told in the third person, a dark fairy tale about a girl named Charlotte—her doppelgänger struggles through a heartrending decision: “Whether to take her own life or undertake something wildly unusual.”
这本书揭示了她在第三帝国阴影下生活和学习艺术的艰辛历程。值得注意的是,这个故事通过769幅画作展现。在这部名为《生活?还是戏剧?》的作品的结尾,萨洛蒙回答了这个问题,她得出结论:一种极不寻常的生活总比没有生命好。然而,在纳粹政权统治下,她无法决定自己的命运:1943年,怀孕的萨洛蒙在奥斯维辛去世。
The book reveals her struggle to live and to study art in the shadow of the Third Reich. Remarkably, the tale is told through the medium of 769 paintings. By the end of the work—which she titled Life? or Theatre?—Salomon has answered the question, concluding that an extremely unusual life would be preferable to no life at all. But alas, under the Nazi regime, it was not up to her: In 1943, Salomon died, pregnant, at Auschwitz.
然而,她的作品并没有随她而去。《生活?还是戏剧?》最终被送回了她的父亲和继母手中,他们在战争期间一直躲藏在荷兰。几乎立刻,它就被认为是非凡的。它被称为“安妮·弗兰克日记的图画对应物。”
Yet her work did not die with her. Life? or Theatre? was eventually returned to her father and stepmother, who had spent the war in hiding in Holland. Almost immediately, it was recognized as extraordinary. It has been called “the pictorial counterpart of Anne Frank’s diary.”
或许现代艺术的理念并没有像凯勒所说的那样,以其强大的力量崛起,摧毁纳粹。但凯勒至少部分正确。尽管纳粹残酷地压制现代艺术——禁止、没收、嘲笑和谋杀其实践者——但现代艺术的理念是无法被扼杀的。它们确实会“通过千万个渠道渗透”,通过像萨洛蒙参观“堕落艺术”那样难以预测的途径。虽然萨洛蒙本人被杀,但她的作品最终“激发了其他人的思想”。她的证词——一位现代艺术家的证词,浸润在现代艺术大师的思想中,并以现代艺术的语言进行见证——在纳粹政权统治结束后依然存在,并在确保纳粹成为“最受憎恨和鄙视的人”方面发挥了作用。
Perhaps the ideas of modern art did not rise up in their might, as Keller suggested, to destroy the Nazis. But Keller was at least partly correct. Despite the brutal Nazi efforts to suppress modern art—prohibiting it, confiscating it, mocking it, and murdering its practitioners—the ideas of modern art could not be killed. They would indeed “seep through a million channels,” through unpredictable avenues like Salomon’s visit to Entartete Kunst. And though Salomon herself was killed, her works did eventually “quicken other minds.” Her testimony—the testimony of a modern artist, steeped in the great masters of modern art and testifying in the language of modern art—outlived the Nazi regime, and played a role in ensuring that the Nazis became the most “hated and despised of all men.”
夏加尔和萨洛蒙——师生——从未见过面。但在萨洛蒙去世多年后,夏加尔有机会在艺术节上看到她的作品。他深受感动。夏加尔处理了这些作品如此温柔。他非常感动,说道:“太好了,它们太好了。”
Chagall and Salomon—the teacher and the student—never met in person. But many years after Salomon’s death, Chagall had the opportunity to see her work at an art festival. He was deeply moved. Chagall handled the works “so tenderly. He was very touched by them and said—good, they were good.”
1944年纳粹入侵匈牙利后,他们开始屠杀该国的犹太人。每天,都有超过一万名匈牙利犹太人被火车送往奥斯维辛集中营。为了逃亡,埃雷兹的祖父、祖母、父亲和姑姑都躲了起来。然而,每天早晨,他的祖父都会从藏身之处出来祈祷,戴着一对刻有希伯来圣经经文的经文匣。尽管他这样做,如果被发现阅读犹太礼拜仪式的经文,他将面临最终的代价。
After the Nazis invaded Hungary in 1944, they began killing the country’s Jews. Each day, more than ten thousand Hungarian Jews were taken by train to the Auschwitz death camp. To escape, Erez’s grandfather, grandmother, father, and aunt went into hiding. Yet every morning, his grandfather emerged from their hiding place to pray, donning a pair of tefillin containing passages from the Hebrew Scriptures. He did so despite the fact that, had he been caught reading the texts of the Jewish liturgy, he would have risked paying the ultimate price.
就在我们撰写本章之际,埃雷兹的父亲——四人中的最后一位——去世了。他留给埃雷兹一个珍贵的包裹:他父亲的经文护符盒,战争期间他每天都戴着。它们被精心保存着:这张百年羊皮纸上的每一个字母都完好无损。
As we were writing this chapter, Erez’s father—the last of the four—passed away. He left Erez a treasured parcel: his own father’s tefillin, worn each and every day of the war. They had been carefully preserved: Each letter of the century-old parchment was intact.
确实有百万个频道。
A million channels, indeed.
左就像物种一样,思想可以繁衍并流行。就像物种一样,思想也会变异。权利的概念就是一个例子。
Like species, ideas can reproduce and become popular. Like species, ideas can also mutate. One example is the notion of rights.
权利的概念就是一个例子。这一概念历史悠久,可以追溯到罗马帝国,其概念“公民权”(ius civitatis)即个体公民的权利。在约翰·洛克(1632-1704)等哲学家理论的推动下,基本权利的概念在十七世纪开始成为许多法律体系的基石。十八世纪以来,美国经历了一系列改革,例如英国的《权利法案》(1689年)、美国的《权利法案》(1789年)以及法国的《人权和公民权宣言》(也是1789年)。在美国,“民权”的概念主要指黑人的权利,黑人成为这个新兴国家如何处理少数族裔问题的试金石。
One example is the notion of rights. This idea has a long history that traces as far back as the Roman Empire, with its concept of ius civitatis, the rights of the individual citizen. Energized by the theories of philosophers like John Locke (1632–1704), the concept of fundamental rights began to form the bedrock of many legal systems in the seventeenth and eighteenth centuries, through innovations like the English Bill of Rights (1689), the American Bill of Rights (1789), and the French Declaration of the Rights of Man and of the Citizen (also 1789). In the United States, the idea of civil rights came to refer primarily to the rights of blacks, who became a test case for how the new nation would handle racial minorities.
受民权运动成就的鼓舞,其他团体也纷纷加入这股道德潮流。妇女权利运动在19世纪60年代内战后首次展现出曙光,并在一个世纪后的民权运动中加速发展。近几十年来,儿童权利和动物权利的普及程度也越来越高。时至今日,两种错误仍然无法构成一种正确。但幸运的是,太多的错误才能促成一场权利运动。
Encouraged by the achievements of the civil rights movement, other groups jumped on this ethical bandwagon. Women’s rights first exhibits a signal after the Civil War in the 1860s, and picks up speed during the civil rights movement a century later. In recent decades, children’s rights and animal rights have become more widespread. Today, two wrongs still don’t make a right. But fortunately, too many wrongs do make a rights movement.
记忆的持久性
THE PERSISTENCE OF MEMORY
Before we move on, we want to tell you about one last movement to get rid of ideas.
这次运动与我们在上一章描述的审查制度大相径庭。它并非由政府主导。没有发生流血事件,尽管在一次著名的摊牌中,该运动的一位领导人确实用壁炉拨火棍威胁了一位异见者。而且,它并非始于德国,而是始于边境对岸的奥地利,发生于20世纪20年代。
This one differed greatly from the censorship efforts we described in the previous chapter. It was not led by a government. No blood was spilled, although, in a famous showdown, one of the principals of the movement did threaten a dissenter with a fireplace poker. And it did not begin in Germany, but across the border, in Austria, in the 1920s.
在那里,一群被称为维也纳学派已经厌倦了人类语言,在他们看来,语言简直是一团糟。维也纳学派所信奉的方法,通常被称为逻辑实证主义,认为只有那些能够被经验验证的陈述才有意义,只有那些可以衡量的词语才有意义。其余的则导致了“抑制偏见”,而没有这些偏见我们会更好。你可以想象,这让很多词语被搁置一旁。爱可以衡量吗?你能凭经验验证某件事是否正确或道德吗?不,圆圈说,你不能。而且,由于这些词指的是无法衡量的事物,它们根本不属于我们的语言。
There, a group of philosophers known as the Vienna Circle had become fed up with human language, which was, in their estimation, a dreadful mess. The approach that the Vienna Circle espoused, often referred to as logical positivism, held that the only statements that make sense are those statements that can be empirically verified, that the only words that are meaningful are those that can be measured. The rest led to “inhibiting prejudices,” and we’d be better off without ’em. As you might imagine, this put quite a lot of words on the chopping block. Is love measurable? Can you empirically verify that something is right or moral? No, said the circle, you can’t, and because these words refer to things that can’t be measured, they don’t belong in our language at all.
该组织最喜欢举的例子之一是“Volksgeist”(人民精神)这个词。“Volksgeist ”本意是指一个国家的集体意识和记忆,指这个国家是什么样的,以及它心中的想法。“Volksgeist”正是那种不精确、无法衡量的概念,让该组织感到恼火,因此该组织在1929年的宣言中强调了这个词,希望将其从语言中彻底清除。
One of the circle’s favorite examples was the word Volksgeist, “spirit of the people.” Volksgeist was supposed to refer to a nation’s collective consciousness and memory, to what the nation was like and what it had on its mind. Volksgeist was exactly the sort of imprecise, unmeasurable concept that irritated the circle, so the group highlighted the term in its 1929 manifesto, hoping to banish it from language altogether.
维也纳学派的思想倾向与其说是政治审查的问题,不如说是其对科学界限的哲学态度的问题。
The Vienna Circle’s idea-animus was less a matter of political censorship and more a matter of its philosophical attitude toward the boundaries of science.
当时,圆圈或许是对的。像集体记忆这样的概念长期以来一直处于科学研究的范围之外。但有了ngrams,探索集体记忆这样的概念似乎并非遥不可及。我们能测量吗?集体记忆,就像我们测试一个人的记忆一样?
At the time, the circle may have been right. Ideas like collective memory have long stood outside the purview of scientific investigation. But with ngrams at our disposal, probing a concept like collective memory seems less improbable. Can we measure collective memory, the same way that we might test the memory of a single person?
如果我们想要衡量集体记忆,首先需要了解个体记忆的科学原理。为此,我们必须求助于另一位哲学家,19世纪的德国哲学家赫尔曼艾宾浩斯。艾宾浩斯对心智的运作方式很感兴趣,我们现在称之为心理学。当时,心理学是哲学的一个分支,还不是一门成熟的科学。人们倾向于对心智进行理论研究,但很少进行实验。
If we were going to try to measure collective memory, it would help first to understand the science of individual memory. For this, we must turn to another philosopher, a nineteenth-century German named Hermann Ebbinghaus. Ebbinghaus was interested in how the mind works, a domain that we would now call psychology. In those days, psychology was a branch of philosophy, not yet a full-fledged science. People tended to theorize about the mind, but rarely performed experiments.
艾宾浩斯的出现早于维也纳学派,但他认同经验、测量和实证验证是人类知识的基础这一观点。他的信念并没有极端到将大多数心理学概念(无论是未经测量的还是可能无法测量的)扔进词汇堆里。相反,他认为对心智的研究需要更加实证化。为了验证这一原理,他着手做了一件在当时难以想象的事情:用纯实验的方法研究自己的记忆。
Ebbinghaus predated the Vienna Circle, but he was sympathetic to the idea that experience, measurement, and empirical confirmation were the foundations of human knowledge. He was not extreme enough in his beliefs to consign most of the concepts of psychology, unmeasured and perhaps unmeasurable, to the lexical scrap heap. Instead he thought that the study of the mind needed to become more empirical. As a proof of principle, he set out to do something that was, at the time, unthinkable: to investigate his own personal memory using purely experimental methods.
他立即面临一个与我们研究名望时类似的问题。记忆是一个模糊的概念。艾宾浩斯需要将记忆这个广阔而模糊的概念,替换为少数几个明确定义、可观察的过程,从而更加清晰地聚焦。他最终确定了两个因素:我们学习的速度和我们遗忘的速度。
He immediately faced a problem that resembled the problem we faced when studying fame. Memory was a vague concept. Ebbinghaus needed to sharpen his focus by replacing the vast, ambiguous terrain of memory with a small number of well-defined, observable processes. He settled on two: how fast we learn, and how fast we forget.
即使缩小了范围,艾宾浩斯仍然面临着重大挑战。实验极大地受益于隔离的受控环境。人类记忆并不适合这样做。我们头脑中的每一条信息都嵌入在概念网络中。我们将其与相关的事实、想法、人物、情感、地点、时间和事件联系起来。这些复杂的关系对回忆有非常显著的影响。因此,很难研究我们孤立地记住特定事实的能力。我们已经看到,通过结合在一起,不规则动词如burn/burnt、learn/learnt、spell/spelt和spill/spilt可以存活几个世纪。这些记忆效应不是例外;它们是规则。
Once he narrowed his scope, Ebbinghaus still faced significant challenges. Experiments benefit immensely from an isolated, controlled environment. Human memory doesn’t lend itself to that. Every piece of information in our mind is embedded in a network of concepts. We associate it with related facts, ideas, people, emotions, places, times, and events. These complex relationships have a very significant effect on recall. As a result, it’s very hard to study our ability to remember a particular fact in isolation. We’ve already seen how, by banding together, irregular verbs like burn/burnt, learn/learnt, spell/spelt, and spill/spilt can survive for centuries. These sorts of memory effects are not the exception; they are the rule.
为了解决这个问题,艾宾浩斯想出了一个巧妙的解决方案。他意识到大多数联想都与人们试图记忆的内容的声音或意义有关。为了尽量减少不必要的联想,他决定记忆一些随机的无意义词汇:一套由他自己设计的2300个无意义音节组成的合成词汇。每个音节都只是三个字母的组合,辅音-元音-辅音,就像CUV和KEF。他小心翼翼地确保每个音节听起来都不像一个单词。这个冰冷的新世界没有空间容纳LUV,没有时间拥抱,也没有地方容纳意义。
To get around this problem, Ebbinghaus came up with an elegant solution. He realized that most associations have to do with either the sound or the meaning of what one is trying to memorize. In order to minimize unwanted associations, he decided to memorize random nonsense: a synthetic vocabulary consisting of 2,300 meaningless syllables that he had devised himself. Each syllable was just a trio of letters, consonant-vowel-consonant, like CUV and KEF. He carefully made sure that none of the syllables sounded too much like a word. This cold new world had no room for LUV, no time for a HUG, and no place for meaning.
为了衡量学习效果,艾宾浩斯会从他的词汇表中随机抽取一些无意义的音节,将这些音节串联成一个列表。然后,他会测量自己记住这些列表所需的时间,并准确无误地背诵每个音节。为了衡量遗忘效果,艾宾浩斯又增加了一个步骤。在学习完一个列表后,他会等待一段固定的时间,然后看看自己还记得多少内容。
To measure learning, Ebbinghaus would draw random nonsense syllables from his vocabulary, chaining these random syllables together into lists. He would then measure how long it took him to memorize those lists, reciting each syllable with no errors. To measure forgetting, Ebbinghaus added another step to the procedure. After learning a list, he would wait for a fixed period of time, and then see how much of the list he still remembered.
记忆长串随机音节的前景日复一日的实验或许对很多潜在的测试对象来说没什么吸引力,但艾宾浩斯却对一位志愿者产生了不成比例的影响:他自己。于是,艾宾浩斯在1878年开始研究记忆,并以自己作为唯一的测试对象。
The prospect of memorizing long strings of random syllables day after day probably didn’t appeal to many potential test subjects, but Ebbinghaus did have disproportionate influence on one volunteer: himself. So in 1878, Ebbinghaus began to study memory, using himself as the only test subject.
两年多来,他坚持着极其严格的学习计划,每天花大量时间记忆随机的无意义音节。他学习了一份又一份音节表,采用高度规范的体系,按照机械表滴答声所规定的恒定节奏重复这些音节。他系统地探索了各种变量的组合——音节表的长度、一天中的时间、记忆的时间、特定音节在表中的位置、重复的时间间隔等等。艾宾浩斯是心理学史上最专注的研究者之一。
For more than two years, he stuck to a painfully strict schedule, dedicating long stretches of time each day to memorizing random nonsense syllables. He learned list after list, using a highly regimented system, repeating the syllables at a constant rhythm dictated by the ticking of a mechanical watch. He systematically explored many combinations of variables—the length of the list, the time of day, the amount of time he spent memorizing, the position of particular syllables in the list, the time interval between repetitions, and so on. Ebbinghaus was one of the most dedicated researchers in the annals of psychology.
而大自然也因此给了他一系列惊人的发现。例如,艾宾浩斯发现,随着单词表变长,即使只增加一个音节,对学习时间的影响也可能异常巨大。记忆项目数量和时间之间的这种关系今天被称为学习曲线,当人们谈论“陡峭的学习曲线”时,他们指的——无论他们是否意识到——都是艾宾浩斯。艾宾浩斯还在遗忘方面有了重要发现。他注意到,仅仅二十分钟后,他通常就会忘记单词表的近一半。但遗忘速度似乎会减慢;即使一个月后,他仍然可以记住大约五分之一的单词表。艾宾浩斯发现的遗忘与时间之间的关系被称为“遗忘曲线”。
And nature rewarded him for it with an ensemble of spectacular discoveries. For instance, Ebbinghaus discovered that, as the lists got long, the impact on learning time of even a single additional syllable could be disproportionately large. This relationship, between the number of items memorized and time, is today called the learning curve, and when people talk about a “steep learning curve,” they’re referring—whether they know it or not—to Ebbinghaus. Ebbinghaus also made important discoveries about forgetting. He noticed that after only twenty minutes, he typically forgot nearly half of the list. But forgetting seemed to slow down; even a month later, he could still remember about a fifth of the list. The relationship Ebbinghaus discovered between forgetting and time is called the “forgetting curve.”
总的来说,学习曲线、遗忘曲线以及用于发现它们的程序为现代人类记忆的科学研究奠定了基础。无意义音节表是一项非常有效的创新,至今仍是心理语言学的核心方法。事实上,艾宾浩斯的研究为整个现代心理学奠定了基础。当然,他对这项研究本身的奉献精神也非同凡响。心理学之父威廉·詹姆斯后来评价艾宾浩斯的非凡奉献精神,称赞他“追求真实平均值的英雄主义”。詹姆斯还称记忆研究是“实验心理学史上最杰出的研究”。
Taken together, the learning curve, the forgetting curve, and the procedures used to discover them laid the groundwork for the modern scientific study of human memory. The notion of a nonsense syllabary was such an effective innovation that it remains a central method in psycholinguistics to this day. Indeed, Ebbinghaus’ work was a foundational moment for modern psychology as a whole. And, of course, his personal dedication to the study itself was extraordinary. William James, a founding father of psychology, later remarked on Ebbinghaus’ extraordinary dedication, lauding him for his “heroism in the pursuit of true averages.” James also called the memory study “the single most brilliant investigation in the history of experimental psychology.”
起初,集体记忆似乎难以探究,但艾宾浩斯的故事给了我们乐观的理由。他所衡量的事物——学习和遗忘——在人类文化中有着密切的对应,这在ngrams中非常明显。
At first, collective memory seemed like a hard thing to probe, but Ebbinghaus’ story gave us cause for optimism. The things he had managed to measure—learning and forgetting—have close analogues in human culture, which are very apparent in the ngrams.
有些事情难以忘怀。两架飞机撞上纽约世贸中心十多年后,那一天的记忆仍然萦绕在美国人民的心头。十年后,《纽约客》特约撰稿人乔恩·李·安德森回忆起他的经历:
Some things are hard to forget. More than a decade after two planes barreled into New York’s World Trade Center, the memory of that day still haunts Americans. Ten years later, Jon Lee Anderson, a staff writer at the New Yorker, recalled his experience:
恐惧感迅速蔓延,我目睹了第二架飞机坠毁,意识到这是一次恐怖袭击。当建筑物倒塌时,我意识到这次袭击堪比第二次珍珠港事件。我知道我的国家很快就要陷入战争了。
With a sense of rapidly growing horror, I saw the second plane hit and realized that it was a terrorist attack and, when the buildings collapsed, that the attack was akin to a second Pearl Harbor. I knew that my country would soon be at war.
这种比较并不罕见,而且确实如此。大约在9/11事件发生60年前,美国人醒来时意识到这是几十年来美国本土首次遭遇袭击。1941年12月7日上午,数百架日军飞机蜂拥至夏威夷珍珠港基地,投下炸弹和鱼雷,留下硝烟四起、战火纷飞、伤亡惨重的景象。短短一个多小时,日军摧毁了无数飞机和舰船,重创了太平洋舰队。此次袭击造成2400多名美国人死亡,1000多人受伤。这则令人震惊的消息改变了历史进程,将美国从冷战边缘拉了出来,卷入了第二次世界大战。
This is not an infrequent comparison, and rightly so. Roughly sixty years before the morning of 9/11, Americans woke up to the first attack on their home soil in decades. On the morning of December 7, 1941, hundreds of Japanese planes swarmed the Hawaiian base of Pearl Harbor, dropping bombs and torpedoes, leaving smoke and fire and death in their wake. In little more than an hour, the Japanese destroyed numerous airplanes and ships, crippling the Pacific Fleet. The attack left more than 2,400 Americans dead and more than 1,000 wounded. The shocking news changed the course of history, thrusting the United States off the sidelines and into World War II.
尽管珍珠港事件在当时意义重大,但近一个世纪过去了,这场袭击已不再频繁出现在日常对话中。现在或许难以想象,但9·11事件也正走在同一条道路上。
But important though it was at the time, the better part of a century has passed since Pearl Harbor, and the attack no longer figures frequently in daily conversation. It may be hard to imagine right now, but 9/11 is on the same course.
这是怎么发生的?我们的集体记忆怎么会抹去哪怕最痛苦的事件?
How does that happen? How does our collective memory wipe out even the most painful events?
为了探究这一点,我们面临一个类似艾宾浩斯的问题:遗忘是如此特殊,如此依赖于我们将哪些想法与其他想法联系起来,因此很难进行良好的实验。
To probe this, we face an Ebbinghaus-like problem: Forgetting is so idiosyncratic, so dependent on which ideas we associate with which other ideas, that it’s hard to do a good experiment.
以卢西塔尼亚号远洋客轮沉没事件为例,它导致了美国加入第一次世界大战。在这场悲剧发生后的几十年里,正如我们预料的那样,它开始被遗忘,但在第二次世界大战爆发前,它又短暂地恢复了人们的记忆,这可能是因为人们对第二次世界大战的担忧将围绕第一次世界大战的事件重新推到了风口浪尖。这种联想记忆效应是一个重大问题:它既无法解释,也无法预测。
Consider the sinking of the ocean liner Lusitania, which brought about America’s entry into World War I. In the decades that follow the tragedy, it starts to be forgotten, much as we might expect, but it recovers, briefly, ahead of World War II, likely because concerns about a second world war brought the events surrounding the first one back to the fore. This sort of memory-by-association effect is a major problem: It’s impossible to account for and impossible to predict.
一个同样棘手的问题是,随着时间的推移,联想的变化会导致人们以不同的方式、使用不同的词语记住同一事件。世界大战就是一个很好的例子。第一次世界大战最初被称为大战,因为它是当时西方文明史上最惨重的战争。但随着20 世纪 30 年代末第二次世界大战爆发, “大战”一词很快就消失了,取而代之的是“第一次世界大战”。至关重要的是,人们并没有停止思考大战。那些事件仍然深深地留在集体记忆中。只是人们在两次冲突的大背景下对战争的看法不同了,所以他们使用了不同的语言。这种影响是无法解释和预测的。
An equally tricky problem is that, over time, changing associations cause people to remember the same events in different ways, using different words. Again, the world wars furnish an excellent example. World War I was originally called the Great War, as it was the deadliest war in the history of Western civilization up to that point. But as World War II began to erupt at the end of the ’30s, the term the Great War quickly disappeared, replaced by the term World War I. Crucially, it’s not that people stopped thinking about the Great War. Those events were still deeply lodged in the collective memory. But people thought about the war differently, in the broader context of both conflicts, so they used different language. Again, this sort of effect is impossible to account for and impossible to predict.
如果我们要测量遗忘,我们需要模仿艾宾浩斯,通过使用精心选择的词汇来最大限度地减少所有这些联想的影响。
If we’re going to measure forgetting, we’ll need to emulate Ebbinghaus, minimizing the effects of all these associations by using a carefully chosen vocabulary.
为了做到这一点,我们决定尝试仅使用与年份对应的数字(例如1816和1952)来探究集体记忆。通过观察人们谈论某一年份的频率,我们可以了解该年份事件在他们脑海中的呈现程度。没有哪个年份会处于特别不利的地位,也没有哪个年份与其他年份的关联过于紧密,以至于会干扰这种粗略的方法。
In order to do just that, we decided to try to probe collective memory using only numerals that correspond to years, like 1816 and 1952. By seeing how often people talk about a year, we can get a sense of how present the events of that year are in their minds. No year is at a particular disadvantage, and no year is so strongly associated with any other year that it interferes too much with this crude approach.
但是,你可能会问,等等。如果这个数字的来源是“请给我1876只半壳牡蛎和一杯Picpoul葡萄酒”呢?在这种情况下,这个数字指的是订购的牡蛎数量。
But wait, you say. What if the sentence that the number came from was “1876 oysters on the half-shell and a glass of Picpoul, please”? In that case, the number is a reference to the number of oysters being ordered.
事实证明,这并不是什么大问题。首先,点一份1876年的牡蛎会很奇怪,尤其是在只喝一杯葡萄酒的情况下。但更重要的是,点、要求或记录任何与1876有关的东西都很奇怪。1876这个数字出现的频率极低——除非人们指的是1876年。即使是像乔治·奥威尔的《1984》这样的书名,以及像斯坦利·库布里克的《2001:太空漫游》对各自总票房的贡献微不足道。
It turns out that this is not a significant problem. First, it would be very strange to order 1876 oysters, especially with only one glass of wine. But more important, it’s very strange to order, request, or record 1876 of anything. The number 1876 comes up incredibly infrequently—except when people are referring to the year 1876. Even titles of books, like George Orwell’s 1984, and movies, like Stanley Kubrick’s 2001: A Space Odyssey, make a negligible contribution to the overall totals for their respective numerals.
1800年至2000年间的201个数字,在集体遗忘研究中,可以发挥艾宾浩斯合成词汇在个体记忆研究中的作用。这些数字教会了我们什么?
The 201 numbers between 1800 and 2000 can play the role, in the study of collective forgetting, that the synthetic vocabulary of Ebbinghaus played in the study of individual memory. What do these numbers teach us?
让我们给你讲一下1950年的故事。
Let us tell you the story of the year 1950.
在人类历史的大部分时间里,没有人关心 1950 年。1700 年没有人关心它,1800 年没有人考虑它,1900 年也没有人关心它。这种冷漠一直持续到 20 世纪 20 年代、30 年代,甚至 40 年代。
For most of human history, no one gave a damn about 1950. No one cared about it in 1700, no one thought about it in 1800, no one was concerned about it in 1900. This apathy persisted through the ’20s and ’30s and into the ’40s.
但从 40 年代初开始,出现了一些嗡嗡声:人们意识到 1950 年即将发生,而且可能会发生重大事件。
But starting in the early ’40s, there was a bit of a buzz: People realized that 1950 was going to happen, and that it could be big.
然而,没有什么能像 1950 年本身那样引起人们对 1950 年的兴趣。
Still, nothing got people interested in 1950 like the year 1950 itself.
突然间,每个人都对 1950 年着迷。他们不停地谈论他们在 1950 年所做的一切事情、他们计划在 1950 年做的事以及他们希望在 1950 年实现的所有梦想。
Suddenly, everyone was obsessed with 1950. They couldn’t stop talking about all the things they did in 1950, all the things they were planning to do in 1950, all the dreams they hoped might come to pass in 1950.
事实上,1950年是如此令人着迷,以至于此后的几年里,人们都觉得有必要总结一下。他们不停地谈论1950年发生的所有奇妙的事情,从1951年到1952年,再到1953年。终于,在1954年,有人——很可能是一位非常注重时尚的人——醒悟过来,意识到1950年已经有点过时了。
In fact, 1950 was so fascinating that for several years thereafter, people felt the need to debrief. They just kept talking about all the amazing things that had happened in 1950, all through ’51, ’52, and ’53. Finally, in 1954, someone—probably someone very fashion-conscious—woke up and realized that 1950 had become slightly passé.
就这样,泡沫破裂了。
And just like that, the bubble burst.
1950年代的故事虽然悲惨,却绝非个例。1950年的历史,就是我们所知历史记载中每一年的故事:男孩遇见X年,男孩爱上X年,男孩离开X年,去追寻一辆更新的车型,随着时间的推移,男孩对X年的回忆越来越少。
Though tragic, 1950’s story is far from unique. The history of 1950 is the story of every year that we have on record: Boy meets year X, boy falls in love with year X, boy leaves year X for a newer model, boy reminisces about his X less and less over time.
我们可以制作这些便捷的图表,逐年展示同样的过程。我们刚才描述的爱与失落的故事在每张图表中都清晰可见,但这并不奇怪。这些图表的其他特点则更出乎意料。
We can make these handy charts showing this same process for every year. The tale of love and loss that we just described is evident on each and every chart, but that’s no surprise. Other features of these charts are more unexpected.
其中一个特征就是这些遗忘曲线的整体形状。遗忘过程似乎由两种模式组成:对某一特定年份的兴趣在最初几十年迅速下降,之后下降得慢得多。集体记忆和个人记忆之间有着惊人的相似之处:社会既有短期记忆,也有长期记忆。
One such feature is the overall shape of these forgetting curves. The forgetting process seems to be composed of two regimes: Interest in a given year drops quickly in the first few decades and much more slowly thereafter. It’s a striking similarity between collective and individual recall: Society has both a short-term and a long-term memory.
我们也可以问一些非常量化的问题。比如,让我们思考一下社会的短期记忆。我们可能会想:泡沫破裂的速度有多快?一年结束后,人们会多快失去对它的兴趣?
We can also ask very quantitative questions. For instance, let’s consider society’s short-term memory. We might wonder: How fast does the bubble burst? How quickly do people lose interest in a year once it has ended?
解决这个问题的一个简单方法是观察一年中集体遗忘的频率下降到峰值的一半需要多长时间:即集体遗忘的半衰期。这个值每年都有很大差异。1872年的集体遗忘频率在 1896 年下降到峰值的一半,滞后了 24 年。相比之下,1973 年的集体遗忘频率在 1983 年就下降到峰值的一半,仅仅过了 10 年。
A simple approach to that question is to see how long it takes for the frequency of a year to decline to half of its peak value: the half-life of collective forgetting. This value varies substantially from year to year. The frequency of 1872 declined to half its peak value in 1896, a lag of twenty-four years. In contrast, 1973 dropped to half its peak value by 1983, after only a decade.
1973年事件的快速衰落反映出一种普遍现象:随着时间的推移,集体遗忘的半衰期越来越短。这一观察表明,我们社会对过去的态度正在发生变化。我们对过去事件的兴趣正在越来越快地消退。
The speedier decline of 1973 is a symptom of a general phenomenon: As time passes, the half-life of collective forgetting gets shorter and shorter. What this observation suggests is that our society’s attitude toward the past is changing. We are losing interest in past events faster and faster.
是什么导致了这种变化?我们不得而知。目前,我们所掌握的只是一些赤裸裸的关联:当我们透过新视野的数字视角审视集体记忆时,所发现的只是这些关联。我们可能还需要一段时间才能弄清楚背后的机制。
What caused that change? We don’t know. For now, all we have are the naked correlations: what we uncover when we look at collective memory through the digital lens of our new scope. It may be some time before we figure out the underlying mechanisms.
这是科学的前沿。这里没有地图,充满猜测,到处都是死胡同,但这里却是最佳去处。
This is the frontier of science. There is no map, lots of guesswork, and plenty of blind alleys, but there’s no place better.
当然,我们的集体意识不仅仅是遗忘。如果我们想要理解集体记忆,我们还需要探究事物的另一面。新的信息是如何进入社会的?
Of course, our collective consciousness does more than just forget. If we’re to understand collective memory, we also need to probe the other side of the coin. How does new information enter a society?
我们视当今时代为信息时代,一个以信息在人与人之间或地与地之间传递的惊人速度为标志的时代。但我们却忽略了过去几个世纪原始信息的传播速度,当时使用的机制的潜力我们如今已无法充分认识到。例如,在十七、十八世纪的伦敦,我们现在所说的蜗牛邮件每天可以送达十五次。早上寄出的信件四小时内就能到达。它当然不如今天的电子邮件快,但也不像今天的蜗牛邮件那么慢。(到了十九世纪,伦敦人可以通过现已废弃的加压管道网络,以高达每小时25英里的速度在城市内运送包裹。)几个世纪以来,人类一直有办法确保重大新闻的快速传播。
We think of our current era as the information age, a period marked by the sensational speed with which information can be passed from one person or place to another. But we lose sight of how quickly raw information could travel in centuries past, using mechanisms whose potential we no longer fully appreciate. In seventeenth- and eighteenth-century London, for example, what we now call snail mail used to arrive as often as fifteen times a day. Letters mailed in the morning would arrive within four hours. It’s not as quick as today’s e-mail, to be sure, but not as slow as today’s snail mail, either. (By the nineteenth century, Londoners could ship parcels around the city, at speeds of up to twenty-five miles per hour, via a now-abandoned network of pressurized tubes.) For centuries, humans have had ways to ensure that big news travels fast.
书籍并非其中一种方式。书籍是传播信息的重要途径,但大多数书籍的规模相对较大,需要数年时间才能完成写作和出版。对于突发新闻来说,它们的传播速度太慢了。
Books are not one of those ways. Books are an important way to get information out, but most books are relatively large undertakings that take years to write and publish. They are much too slow for breaking news.
通常情况下,这不是问题。因为集体遗忘——至少是对于最重要的事情——相对较慢,所以使用从书籍中衍生的 ngram 可以轻松绘制数年、数十年和数个世纪以来的进步。
Often, that’s not a problem. Because collective forgetting—at least, of the most important things—is relatively slow, its progress over years, decades, and centuries is easy to chart using book-derived ngrams.
但许多进入集体意识的事物来得很快,几天、几周、几个月,最多几年就够了。1872年的ngram只用了一年时间就从默默无闻变成了人气巅峰。珍珠港事件也只用了一天。问题是,当我们试图衡量如此快速的过程时,书籍ngram的作用并不大。拍摄快球需要高速快门。
But many of the things entering the collective consciousness enter quickly, in days, weeks, months, or at most a handful of years. The ngram 1872 only took a single year to make the transition from near obscurity to peak popularity. Pearl Harbor took but a day. Trouble is, book ngrams just aren’t very useful when we’re trying to measure such fast processes. You need a high-speed shutter to take a picture of a fastball.
如果我们要使用 ngram 来学习有关学习的知识,我们需要关注一些比大新闻发展得更慢的东西。
If we are going to use our ngrams to learn about learning, we need to look at something that moves more slowly than big news.
埃雷兹的妻子阿维娃开始探索一种看起来特别有前景的集体学习方法:研究发明。成功的发明正是集体学习的缩影。学习。它们反映了社会创造关于世界的新知识,并吸收这些科学和工程进步以克服相关日常挑战的能力。正因如此,发明的传播速度比普通新闻要长得多。
Erez’s wife, Aviva, began exploring one approach to collective learning that seemed particularly promising: the study of inventions. Successful inventions are the very epitome of collective learning. They reflect society’s ability to generate new knowledge about the world and to assimilate these advances in science and engineering to overcome relevant day-to-day challenges. For those very reasons, inventions take much longer to spread than ordinary news.
关键的区别在于,一项发明并非仅仅是通过电子邮件或小马轻松传播的纯粹信息。创造发明具体化的工程技术、运用发明的技术技能、促进其销售和分销的经济模式,以及帮助传播发明的基础设施,所有这些都是社会全面接受一项新技术理念的必要条件。与新闻事件的口口相传不同,一项发明的新闻可能需要数十年才能传播开来。
The crucial difference is that an invention is not just pure information that can be easily communicated via e-mail or pony. The engineering know-how to create an embodiment of the invention, the technical skill to use it, the economic model to motivate its sale and distribution, and the infrastructure to help spread it all are necessary for a society to fully embrace a new technological idea. Unlike word of a newsworthy event, it can take decades for news of an invention to propagate.
使用 ngram 应该很容易探索这些较长的时间尺度。一个很好的例子是传真机。
These lengthy timescales should be easy to explore using the ngrams. A great example is the fax machine.
传真机在 20 世纪 80 年代几乎瞬间出现,并迅速风靡一时。它看起来像是打破了新闻。根据这个单词表,你猜传真机是什么时候发明的?
The fax machine pops up, almost instantaneously, in the 1980s, soaring immediately to peak popularity. It looks like breaking news. Judging by this ngram, when would you guess that the fax machine was invented?
80年代吧?不对。70年代?不对。60年代?50年代?40年代?
The ’80s, right? Nope. The ’70s? Nope. ’60s? ’50s? ’40s?
没错:传真机发明于20世纪40年代,但并非20世纪40年代。传真机的第一个专利于1843年授予苏格兰发明家亚历山大·贝恩。到1865年,巴黎和里昂之间已经建立了当时被称为“电传”的商业服务——这比电话发明还要早。20世纪80年代的尖端技术之一,部分得到了法国皇帝拿破仑三世的支持。重大新闻传播速度很快——但伟大的想法却不然。
You got it: The fax machine was invented in the ’40s. But not the 1940s. The first patent for the fax machine was awarded to Scottish inventor Alexander Bain in 1843. By 1865, a commercial service for what was then called a telefax had been established between Paris and Lyon—before the invention of the telephone. One of the cutting-edge technologies of the 1980s was partly backed by Napoleon III, emperor of France. Big news travels fast—but big ideas don’t.
如果我们想了解发明需要多长时间才能传播开来,我们需要从一长串技术开始,并找出每一项技术的发明时间。
If we want to examine how long inventions take to spread, we need to start with a long list of technologies and figure out when each of them was invented.
你可能会认为这很容易做到。几个世纪以来,各国政府一直在为新发明授予专利,赋予发明者从其创造中获利的专有权。正如唯一一位拥有专利的美国总统亚伯拉罕·林肯所说:“专利制度为天才之火增添了利益的燃料。” 专利法鼓励发明者尽快披露他们的新技术。因此,我们只需找到已颁发的专利,并查看日期,就能确定某项发明的发明时间。
You would think that this is an easy thing to do. Governments have been awarding patents on new inventions for centuries, giving inventors the exclusive right to profit from their creations. As Abraham Lincoln—the only U.S. president to hold a patent—put it, “The patent system added the fuel of interest to the fire of genius.” Patent laws encourage inventors to disclose their new technologies as soon as possible. So all we need to do to figure out when something was invented is to find the patent that was issued, and check the date.
但这也是说起来容易做起来难。
But this, too, is easier said than done.
以电话为例。在美国,电话的发明者是亚历山大·格雷厄姆·贝尔。1876 年 3 月 10 日,贝尔在笔记本中写下了以下内容:
Consider the telephone. In the United States, the invention of the telephone is credited to Alexander Graham Bell. On March 10, 1876, Bell wrote the following entry in his notebook:
然后我对着M(话筒)喊道:“沃森先生——过来——我想见你。”令我高兴的是,他来了,并表示他听到并明白了我的话。
I then shouted into M [the mouthpiece] the following sentence: “Mr. Watson—come here—I want to see you.” To my delight he came and declared that he had heard and understood what I said.
贝尔后来将这项技术商业化,创立了一系列公司,这些公司旗下的各种分支至今仍主导着电信行业。对美国人来说,贝尔是一位科技英雄,他奠定了我们当今信息时代的诸多基础。
Bell later commercialized this technology, creating a series of companies whose various offshoots and offspring still dominate the telecommunications industry. To Americans, Bell is a technology hero who laid many of the foundations that enable our present information age.
但意大利人可不是这么认为的。对意大利人来说,电话的发明者是安东尼奥·梅乌奇。这位意大利裔美国人声称自己在1854年左右发明了电话机,并不断改进设计,直到1870年才成功传播开来。他的声音可以通过电线传输一英里多远。1876年与贝尔一起工作的沃森当时就在隔壁的房间里。
But that’s not how they tell it in Italy. To Italians, the inventor of the telephone is Antonio Meucci. This Italian-American claimed to have invented a telettrofono around 1854 and kept improving on his design until 1870, when he managed to propagate his voice through wire for a distance of more than a mile. Watson, working with Bell in 1876, was only in the next room.
那么伊莱沙·格雷呢?格雷于1872年创立了西部电气制造公司,为西联汇款提供电报设备。格雷利用这项技术,最终发明了可变电阻麦克风。这种装置可以对多音调的声音(例如人声)进行编码,以便通过电线传输。实际上,格雷也发明了电话。
And what about Elisha Gray? Gray founded the Western Electric Manufacturing Company in 1872, which supplied telegraphic equipment to Western Union. Fiddling with this technology, Gray ended up inventing the variable-resistance microphone. This device made it possible to encode multitonal sounds, like human voices, for transmission over a wire. In effect, Gray invented the telephone, too.
电话的发明者名单读起来就像十九世纪末创新者名录,上面罗列着许多可能发明或可能未发明电话的伟人。他们中的许多人都拥有以自己名字命名的专利,以描述他们的贡献。梅乌奇于1871年提交了一份专利警告——一种临时专利——称他的技术为“说话电报”。但这是否意味着梅乌奇理应获得荣誉?奇怪的是,几年后,他的这项专利权就失效了,从未成为正式专利。此外,梅乌奇是否真的制造出他声称制造的东西也并不完全清楚。1876年2月14日,在梅乌奇提交申请近五年后,格雷的律师来到华盛顿特区的专利局,为电话的发明提交了一份专利警告。这表明荣誉应该属于格雷。但当天早些时候,贝尔的律师也来到了同一家专利局。他申请的专利——你猜对了——就是电话的发明。
The list of great minds that may or may not have invented the telephone reads like a who’s who of late-nineteenth-century innovators. Many of them have patents in their name describing their contributions. Meucci filed a patent caveat—a sort of provisional patent—in 1871, calling his technology a speaking telegraph. But does that mean Meucci deserves the credit? Oddly, he let this claim expire some years later, and it never became a full patent. Furthermore, it’s not totally clear that Meucci ever built exactly what he claimed to have built. On February 14, 1876, nearly five years after Meucci’s filing, Gray’s lawyer entered the patent office in Washington, D.C., to file a patent caveat for the invention of the telephone. That suggests that credit ought to go to Gray. But earlier that day, Bell’s lawyer had entered the same office. He had filed a patent for—you guessed it—the invention of the telephone.
甚至不要让我们开始谈论灯泡。
Don’t even get us started on the lightbulb.
明确地确定某件事物的发明时间是不可能的。我们需要妥协。一个选择是尝试通过电话等发明,逐一进行搜索,并根据证据做出最佳猜测。但这很危险。也许我们自身的偏见,无论是有意识的还是潜意识的,都会影响结果。相反,阿维娃做了她能做的最明智的事情:她放弃了,转而使用维基百科。
Unambiguously determining when something was invented is impossible. We needed to compromise. One option is to try to go through inventions, like telephone, one by one, and take our best guess based on the evidence. But that was dangerous. Perhaps our own biases, conscious or subconscious, would influence the results. Instead, Aviva did the smartest thing she could: She gave up and used Wikipedia.
维基百科列出了许多重大发明的日期。我们知道其中一些日期并非最佳日期。但正因为这些日期并非我们自己选定的,所以我们可以肯定,它们不会反映我们的偏见,也不太可能被系统性地歪曲,从而破坏我们的实验。有时候,相亲反而更好。
Wikipedia lists dates for numerous major inventions. We know that some of them are not the best possible dates. But because they aren’t our dates, we can be sure that they don’t reflect our biases, and that they are unlikely to be systematically skewed in a way that will undermine our experiment. Sometimes blind dates are better.
Aviva 仔细检查了每个日期,以确保其合理性——至少有一项最相关的专利是在当时提交的,并且——根据 ngram 统计——该技术在该日期之前尚未被广泛使用,无论以何种名称(例如,既不是传真机,也不是电传机)。如果日期不合理,她就把这项发明从我们的小注册表中剔除。其他的,她都保留。
Aviva checked each date to make sure it was plausible—that at least one of the most relevant patents was filed at that time, and that—per ngrams—the technology was not in wide use before that date, by any name (e.g., neither as fax machine nor as telefax). If the date wasn’t plausible, she struck the invention from our little registry. Anything else, she kept.
她留下的是一个列出 147 个伟大创意及其 147 个诞生日的清单。此清单包括各种各样很酷的小玩意。其中之一就是打字机,由查尔斯·瑟伯于 1843 年获得专利。(有趣的是,他认为打字机对“盲人……和神经质”的人来说特别有用。)另一个引人注目的条目是胸罩,由西格蒙德·林道尔于 1913 年获得专利。该清单包括分子(吗啡和硫胺素)、材料(派热克斯和胶木)、运输方式(直升机和自动扶梯)、炸毁物品的方式(炸药和机关枪)以及大量有用的小玩意(订书机、带锯、安全剃须刀)和概念(巴氏杀菌法)。就像一个好的百货商店,你会找到你需要的一切,无论你需要的是一条牛仔裤还是一个灯泡。而且——就像一家好的百货商店一样——你会发现很多你可能不需要的东西,比如缆车和石油钻机。
What she was left with was a list of 147 big ideas and their 147 birthdays. This list includes all sorts of cool gadgets. One is the typewriter, patented in 1843 by Charles Thurber. (Interestingly, he thought of it as a particularly useful aid for “the blind . . . and the nervous.”) Another upstanding entry is the brassiere, patented in 1913 by Sigmund Lindauer. The list includes molecules (morphine and thiamine), materials (Pyrex and Bakelite), methods of transportation (helicopter and escalator), ways of blowing things up (dynamite and machine gun), and a cornucopia of useful doodads (stapler, bandsaw, safety razor) and concepts (pasteurization). Like a good department store, you’ll find everything you need, whether what you need is a pair of jeans or a lightbulb. And—also like a good department store—you’ll find plenty of things that you probably don’t need, like a cable car and an oil drill.
利用这份清单,我们可以研究伟大发明的诞生故事。有些发明,比如李维·斯特劳斯的牛仔裤,故事才刚刚开始:即使在今天,它们的影响仍在不断增长。其他发明,比如玻璃纸,已经过了鼎盛时期。它们教会了我们一些东西;我们偶尔会用到它们;它们的遗产也传承给了新一代的思想。但从我们的集体记忆的角度来看,它们已经过时了。
Using this list, we could study the life stories of great inventions. In some cases, like Levi Strauss’ jeans, the tale is still just beginning: Even today, their impact continues to grow. Other inventions, like cellophane, are past their prime. They’ve taught us something; we might occasionally use them; and their legacy has been passed on to a new generation of ideas. But from the standpoint of our collective memory, they are old hat.
当然,这份发明清单最让我们兴奋的是,就像艾宾浩斯的无意义音节表一样,它能让我们洞察学习——这一次,是从整个社会的尺度来洞察。在之前的章节中,我们想知道,最有名的人往往在多大年纪才开始对文化记录产生影响。现在,让我们问同样的问题,不过是关于科技。一项发明需要多长时间才能达到其全部文化影响力的四分之一(以ngrams来衡量)?
Of course, what’s most exciting to us about this list of inventions is that, like the nonsense syllabary of Ebbinghaus, it can give us insight into learning—this time, at the scale of whole societies. In an earlier chapter, we wondered how old the most famous people tend to be when they start making an impact on the cultural record. Now let’s ask the same question, but about technology. How long does it take for a given invention to rise to one-quarter of its full cultural impact, as measured by ngrams?
以左轮手枪为例。它于1835年由塞缪尔·柯尔特获得专利。1918年,六发左轮手枪的影响力达到顶峰,其使用频率为,恰如其分地,每百万字出现六次。(这是比尔·克林顿巅峰时期的三倍。)1859年,它达到了每百万字1.5次提及——相当于四分之一的记录。1835年至1859年这二十四年的时间,让我们感受到“左轮手枪”花了多长时间才点燃了我们的集体热情。它衡量了社会对这一特定概念的理解速度。
Consider the revolver. It was patented in 1835 by Samuel Colt. In 1918, the six-shooter reached peak influence, at a frequency of, appropriately enough, six appearances in every million words. (That’s three times as high as Bill Clinton at his peak.) It reached 1.5 mentions per million—the one-quarter mark—in 1859. The length of the period between 1835 and 1859, twenty-four years, gives us a sense of how long the revolver took to fire up our collective enthusiasm. It’s a measure of how quickly society learned about that particular concept.
事实证明,这个数字对于发明的差异比对于名人的差异大得多。索尼随身听于 1978 年发明,仅用了 10 年就达到了四分之一影响力的里程碑。苹果的iPod也同样受欢迎——如果你想让你的发明迅速产生巨大影响,便携式音乐播放器似乎是最佳选择。和左轮手枪一样,玻璃纸也用了大约 25 年才达到四分之一影响力的里程碑。打字机用了 45 年。而蓝色牛仔裤用了 103 年。照这个速度,施特劳斯作为一名数学家可能会产生更快的影响。
It turns out that this number varies much more for inventions than it does for celebrities. Sony’s Walkman, invented in 1978, took only a decade to reach the quarter-impact milestone. Apple’s iPod was a similar hit—if you want your invention to make a big impact fast, portable music players seem like the way to go. Like the revolver, cellophane took about a quarter of a century to reach the quarter-impact mark. The typewriter took forty-five years. And blue jeans took 103. At that rate, Strauss might have made a faster impact as a mathematician.
但这些数字——一项新技术要花一个世纪才能传播——似乎非常长。如今,新技术已经频繁地改变着我们的日常生活。这是怎么回事?集体学习会加速吗?
But these numbers—a century for a new technology to spread—seem very large. Today new technologies routinely transform our daily lives. What’s going on? Could collective learning be speeding up?
使用 ngrams,我们可以检查。
Using ngrams, we can check.
为此,我们将受艾宾浩斯启发的发明清单与安德沃德的队列研究方法相结合。我们按发明日期排列了147项技术,从1801年的提花织机开始,到1920年的早期电子乐器特雷门琴结束。然后,我们将它们分为三个时期:发明十九世纪初的发明、十九世纪中期的发明以及世纪之交的发明。
To do so, we combined our Ebbinghaus-inspired list of inventions with Andvord’s cohort method. We arranged our 147 technologies by date of invention, starting with the Jacquard loom (1801) and ending with the theremin, an early electronic instrument (1920). We then grouped them into three periods: inventions of the early nineteenth century, inventions of the mid–nineteenth century, and inventions from around the turn of the century.
集体学习在不同时期的差异显而易见。19世纪初的技术花了65年才达到四分之一影响力的水平。而世纪之交的发明仅用了26年。集体学习曲线越来越短,每十年缩短约2.5年。社会学习的速度越来越快。
The differences in collective learning over time were obvious. Early-nineteenth-century technologies took sixty-five years to reach the quarter-impact mark. Turn-of-the-century inventions took only twenty-six years. The collective learning curve has been getting shorter and shorter, shrinking by about 2.5 years every decade. Society is learning faster and faster.
这是为什么呢?就像集体遗忘一样,我们也不清楚。但其潜在的后果值得深思。
Why is that? As with collective forgetting, we don’t quite know. But the potential consequences are fascinating to contemplate.
我们不断缩短的集体学习曲线可能带来的最有趣的结果之一,源自物理学家斯坦尼斯拉夫·乌拉姆和博学之士约翰·冯·诺依曼之间的一次对话。乌拉姆深谙那些影响深远的发明:他发明了氢弹。诺依曼是一位著名的数学家、物理学家、博弈论家,也是计算机科学的奠基人之一。(诺依曼还创造了(他们当时的对话一定非常精彩。)尽管诺伊曼无法精确量化,但他感觉到技术进步的速度正在加快。在与乌拉姆的对话中,他观察到:
One of the most intriguing possible outcomes of our ever-shrinking collective learning curve emerged from a conversation between the physicist Stanislaw Ulam and the polymath John von Neumann. Ulam was a man who knew about inventions that make a big impact: He invented the hydrogen bomb. Neumann was a famous mathematician, physicist, and game theorist, and a founding father of computer science. (Neumann also coined the phrase Mutually Assured Destruction and its acronym, MAD. Their conversations must have been very fascinating.) Despite his inability to precisely quantify it, Neumann sensed that the rate of technological advancement was increasing. In conversation with Ulam, he observed:
技术的不断进步和人类生活方式的变化……似乎正在接近人类历史上的某个本质奇点,超越这个奇点,我们所知的人类事务就无法继续下去。
The ever accelerating progress of technology and changes in the mode of human life . . . gives the appearance of approaching some essential singularity in the history of the race beyond which human affairs, as we know them, could not continue.
这个想法是由未来学家推广的雷·库兹韦尔(Ray Kurzweil)曾指出,计算机芯片的性能不断提升——这一著名的规律被称为摩尔定律——他认为,到2045年,一台普通计算机的处理能力将超过所有人类大脑的总和。到那时,他预测,我们只需将思维下载到磁盘上,就能在机器中永存。这就是库兹韦尔所说的技术奇点。
This idea was popularized by futurist Ray Kurzweil, who noted that the rate at which computer chips were getting more powerful—a famous regularity known as Moore’s law—suggests that, by 2045, an ordinary computer will have more processing power than all of mankind’s brains put together. At that point, he predicts that it will be possible to just download our thoughts onto a disk, and to live forever among the machines. This is what Kurzweil refers to as the technological singularity.
这听起来可能有点奇怪,但库兹韦尔并非疯子。他在麻省理工学院读书时就卖掉了自己的第一家公司,并发明了众多被广泛应用的技术。比尔·盖茨称库兹韦尔是“我认识的预测人工智能未来最好的人”,《福布斯》杂志则称他为“终极思维机器”。2001年,他获得了50万美元的莱默尔森-麻省理工学院奖——这是全球授予发明家的最高奖项——以及比尔·克林顿颁发的国家技术奖章,而克林顿的名气比你沙拉里的大部分配料都要大。所以,库兹韦尔的学识毋庸置疑。但他的学识真的如此吗?
This may seem like a strange concept, but Kurzweil is no loony. He sold his first company while a student at MIT and has invented numerous widely used technologies. Bill Gates called Kurzweil “the best person I know at predicting the future of artificial intelligence,” and Forbes branded him “the ultimate thinking machine.” He was awarded the $500,000 Lemelson-MIT Prize in 2001—the world’s largest prize for inventors—as well as a National Medal of Technology from Bill Clinton, a man more famous than most of the ingredients in your salad. So there’s no doubt that Kurzweil knows his stuff. But is he right?
我们真的不知道。Ngrams 告诉我们过去的事情。可惜的是,它们无法预测未来。至少目前如此。
We really don’t know. Ngrams tell us about the past. Alas, they do not predict the future. Yet.
我们对记忆的粗略测量表明,维也纳学派一个世纪前认为不可能实现的事情是可以实现的:通过对集体意识和集体记忆进行实证测量,来量化人民的精神,即Volksgeist。
Our crude measures of memory suggest that it’s possible to achieve what the Vienna Circle had thought impossible a century ago: to quantify the spirit of the people, the Volksgeist, by empirically measuring aspects of collective consciousness and collective memory.
但我们没有告诉你的是,这是一项非常危险的尝试。
But what we didn’t tell you is that this is a very dangerous endeavor.
“民族精神”并非一个无害的概念。它是由一位德国哲学家无意中提出的。十八世纪的约翰·戈特弗里德·赫尔德。赫尔德本人秉持多元主义,反对奴隶制、殖民主义,以及种族之间存在根本生物学差异的观念。他认为民族之间存在差异——这些差异构成了他所谓的“民族精神”(Volksgeist) ——但他并不认为这些差异与优劣无关。
Volksgeist is not an innocuous concept. It was introduced, rather innocently, by the German philosopher Johann Gottfried Herder in the eighteenth century. Herder himself was very pluralistic, rejecting slavery, colonialism, and the notion that there were fundamental biological differences between the races. He believed that there were differences between nations—differences that formed what he called Volksgeist—but he didn’t think that they were a matter of superiority or inferiority.
然而,如果你将“民族精神”的概念与极度活跃的民族主义混合在一起,就很容易看出赫尔德的想法如何成为种族主义的遮羞布:我优越,因为我的人民拥有更好的“民族精神”。
Yet if you mix the notion of Volksgeist with hyperactive nationalism, it’s easy to see how Herder’s idea can become a fig leaf for racism: I’m superior, because my people have better Volksgeist.
在某些情况下,情况确实如此。回想一下那些导致德国各地焚书的十二条论纲中学生们的主张。他们“希望尊重德国人民的传统”,清除一切体现非德国精神的东西:非德意志精神(undeutschen Geist)。当谈到种族主义问题时十九世纪和二十世纪, “民族精神”这个概念从未被忽视。
In some cases, this is exactly what happened. Think back to what the students claimed in those twelve theses that led to book burnings all over Germany. They “want to respect the traditions of the Volk” by eliminating anything that reflected an un-German spirit: undeutschen Geist. When it came to matters of racism in the nineteenth and twentieth centuries, the concept of Volksgeist was never far to seek.
但也有更健康的方式来对待“民族精神”。这位德裔美国知识分子弗朗茨·博厄斯,常被称为现代人类学之父,在他的著作中也借鉴了“民族精神”(Volksgeist)这一概念。但他断然拒绝将“民族精神”与极端民族主义意识形态相融合,认为这种危险的混合体是一种在智力和道德上都贫乏的方法。
But there are healthier approaches to Volksgeist, too. The German-American intellectual Franz Boas, often called the father of modern anthropology, drew on the very same notion of Volksgeist in his work. But he categorically rejected the blending of Volksgeist and ultranationalist ideologies, recognizing this dangerous concoction as an intellectually and morally impoverished approach.
相反,他试图将“民族精神”与激发艾宾浩斯的那种经验主义态度相结合。对博厄斯而言,文化瞬息万变,但始终易于观察和经验描述。通过整合这两种传统,博厄斯为文化的科学研究奠定了基础,创造了我们今天所说的人类学。
Instead, he tried to synthesize Volksgeist with the kind of empirical attitude that had motivated Ebbinghaus. To Boas, culture was ever changing but always susceptible to observation and empirical description. By uniting these two traditions, Boas laid the groundwork for the scientific study of culture, creating what we call anthropology today.
正是考虑到博厄斯,当我们与科学家交谈时,我们喜欢将我们所做的事情称为“文化组学”。
It is with Boas in mind that, when speaking to scientists, we like to call what we do “culturomics.”
-omics表示大数据,这个后缀在现代生物学及其他领域中都具有这一含义。
The -omics denotes big data, which is what that suffix has come to imply in modern biology and beyond.
这种文化是博厄斯的文化:可以通过经验了解,其巨大的变化引发了无尽的好奇心和真正的庆祝。
The culture is the culture of Boas: empirically knowable, its vast variations a matter of endless curiosity and genuine celebration.
2010年。在哈佛大学进化动力学项目的一个昏暗的房间里,一个电脑机箱放在桌子上,打开着。袁征刚从谷歌剑桥办公室过来,带来了包含ngram数据的硬盘。结果几个小时前才完成编译。我们插上电源,启动机器,迫不及待地想确认,三年过去了,我们终于得到了我们以为的东西。我们三个人等待电脑启动,房间里唯一的声音就是磁盘旋转的嗡嗡声,令人安心。
2010. In a darkened room at Harvard’s Program for Evolutionary Dynamics, a computer chassis stood on a desk, open. Yuan had just come over from Google’s Cambridge office, bringing with him hard disks containing the ngram data. The results had finished compiling only hours before. We plugged them in and turned the machine on, eager to confirm that, after three years, we finally had what we thought we had. As the three of us waited for the computer to boot, the only sound in the room was the reassuring whir of the spinning disks.
最后,命令提示符。
At last, a command prompt.
从哪里开始?
Where to start?
进化——正是它让我们来到这里。
Evolution—it’s what had gotten us here.
又是一阵嗡嗡声;一分钟过去了;又敲了几下键盘;突然,命令提示符被一张图表取代了。透过这条柔和起伏的曲线,数以百万计的声音跨越几个世纪向我们诉说。这条曲线从浩瀚的数据海洋中汲取灵感,提炼出一个简单却有力、任何人都能理解的故事。
Again, the whir; a minute passed; some more keystrokes followed; and suddenly the command prompt was replaced with a chart. There, through the soft, undulating line, millions of voices spoke to us through the centuries. Drawing from an ocean of data, the curve had distilled a simple, powerful story that anyone could understand.
我们低声表示赞同。这确实是进化。
We murmured our approval. Evolution indeed.
接下来的声音是砰的一声:软木塞被塞住了。
The next sound was a pop: the cork.
有一次,我们试图说服谷歌的某个人,开发一个用于研究 ngram 的公共工具,我们提议把它叫做 Bookworm,这是一个好主意。他很快就让我们大吃一惊,回答说:“谁会用它?教授们。现在,假设世界上每个教授都用 Bookworm。那就有十万人了。在谷歌,十万用户根本算不上什么。”
Once, we had tried to convince someone at Google that building a public tool for studying ngrams, which we proposed calling Bookworm, was a good idea. He quickly took us down a peg or two, responding, “Who’s going to use it? Professors. Now, suppose every single professor in the world uses Bookworm. That’s, say, one hundred thousand people. At Google, a hundred thousand users doesn’t even move the needle.”
我们很难对此提出异议。
It was hard for us to argue with that.
但一旦我们拿到数据,开始研究,就开始注意到一些奇怪的事情:ngram 正在接管我们的生活。我们根本停不下来。我们从进化论开始研究。但不规则动词呢?总统呢?爱因斯坦呢?在鸡尾酒会上,有人会问这样的问题:人们什么时候开始使用“性别歧视”这个词?笔记本电脑会弹出:70 年代初。人们什么时候开始把“doughnut”写成“donut”?笔记本电脑又会弹出:50 年代,就在 Dunkin' Donuts 创立之后。
But once we got the data and started playing with it, we began to notice something odd: The ngrams were taking over our lives. It was impossible to stop looking at them. We had started with evolution. But how about irregular verbs? How about presidents? How about Einstein? At cocktail parties, someone would ask something like: When did people start using the term sexism? Out pops a laptop: the early ’70s. When did people start writing donut instead of doughnut? Out comes that laptop again: in the ’50s, right after the founding of Dunkin’ Donuts.
我们开始开会,目标是写一篇科学论文来描述我们最有趣的发现。我们觉得,如果写成论文,就能帮助我们继续前进。但每次我们开始写一个主题,就会被一组新的ngrams彻底分散注意力。零食!公司!恐龙!会议结束时,我们意识到,与最新的令人大开眼界的发现相比,我们自认为最有趣的发现显得枯燥乏味。这简直是无解的局面。我们该如何戒掉这个瘾呢?
We started having meetings with the goal of writing a scientific paper to describe our most interesting findings. If we wrote a paper, we thought, it would help us move on. But each time we started writing about one topic, we would get hopelessly distracted by a new set of ngrams. Snack foods! Companies! Dinosaurs! By the meeting’s end, we realized that what we thought were our most interesting findings were boring in comparison to the latest eye-opener. It was an impossible situation. How could we manage to break our addiction?
我们需要休息一下,给自己时间聚集起来我们的想法。于是,我们拿出了四台可以访问ngram数据库的笔记本电脑——世界上仅有的四台可以运行我们Bookworm原型界面的笔记本电脑——并开始将它们赠送给其他人。其中一台给了平克,他很快就开始寻找图表,用于他正在写的书中。另一台给了埃雷兹的妻子阿维娃。她立即报告了新的发现:在德语ngram中查找门德尔松的作品让她开始追踪审查制度。现在她也上瘾了。
We needed to take a breather, to give ourselves time to gather our thoughts. So we took the four laptops that had access to the ngram database—the only four laptops in the world that could run our prototype Bookworm interface—and started giving them away to other people. One went to Pinker, who quickly began finding charts to include in the book he was writing. Another went to Aviva, Erez’s wife. Immediately she reported new discoveries: Checking the German ngram for Mendelssohn had led her to start tracking censorship. Now she was addicted, too.
第三台机器给了马丁·诺瓦克。回家后,他漫不经心地把“书虫”给当时16岁的儿子塞巴斯蒂安看。塞巴斯蒂安输入了一个查询,弹出了一个图表。他好奇地又试了一次;输入了两个查询后,他从马丁手中拿走了机器,然后告辞了。又过了十分钟,他打电话给一个朋友:“你一定要过来看看。”朋友来了,两人又一次输入查询,直到深夜。
A third machine went to Martin Nowak. When he got home, he casually showed Bookworm to his son Sebastian, who was sixteen years old at the time. Sebastian typed in a query. A chart popped up. Intrigued, he tried another; two queries in, he took the machine away from Martin and excused himself. After ten minutes more, he called a friend: “You have to come over and see this.” The friend came over, and the two typed in query after query late into the night.
最后一台机器被送往了谷歌2010年图书馆峰会,我们受邀在那里发表主题演讲。峰会上,谷歌通常会向世界各地的图书馆负责人披露其数字化项目的最新消息。
The last machine went to Google’s 2010 Library Summit, where we had been invited to give a keynote address. The summit was where Google typically disclosed the latest news about its digitization project to the heads of many world libraries.
你可能会觉得图书管理员是那种安静的人。但我们的经历并非如此。
Now, you would think librarians are the quiet type. That was not our experience.
当我们解释我们正在做的事情的基本概念时,热情开始高涨——没有人听说过这样的事情,当然也没有人听说过这么大规模的事情。挤满了人的演讲厅里,每个人都全神贯注地听着。当我们开始展示几个例子时,房间里的气氛变得异常热烈。终于,在45分钟后,我们停止了演讲,启动了Bookworm。我们问观众:“还有什么问题吗?” 热烈的掌声响起,我们曾经也这样。之前从未听说过。但与此同时,你能听到图书管理员们开始控制不住自己,大声喊叫:
As we explained the basic concept of what we were doing, enthusiasm began to mount—no one had ever heard of anything like this, certainly not at this scale. We had the full attention of every single person in the packed lecture hall. By the time we started showing a few examples, the energy in the room was extraordinary. Finally, after forty-five minutes, we stopped talking and booted up Bookworm. We asked the audience, “Any queries?” We were greeted with thunderous applause, the likes of which we have never heard before or since. But over it, you could hear the librarians begin to shout, unable to contain themselves:
“试试他和她!”
“Try he versus she!”
“输入全球变暖!”
“Type in global warming!”
“海盗大战忍者!”
“Pirates versus ninjas!”
房间里充满了兴奋、好奇、欢乐和极度着迷的气氛。
The room exploded with excitement, curiosity, glee, and utter fascination.
这些 ngram 令人着迷,让人无法抗拒,而且完全让人上瘾。就好像我们发现了一种全新且极其古怪的海洛因。
The ngrams were spellbinding, irresistible, and totally addictive. It was as though we’d discovered a new and extremely nerdy form of heroin.
坐在前排的丹·克兰西(Dan Clancy)看得出,我们精心设计的这个奇特的小玩意儿,对谷歌用户来说,一定会像对我们和图书管理员一样充满乐趣。他宣布:谷歌将改编我们的原型,并将其作为谷歌图书的一部分推出。我们非常激动。
Sitting in the front row, Dan Clancy could see that the odd little gizmo we had cooked up was going to be as much fun for Google’s users as it had been for us and the librarians. He gave the word: Google was going to adapt our prototype and launch it as a part of Google Books. We were thrilled.
突然之间,我们的项目从一只循规蹈矩、科学严谨的乌龟变成了一只谷歌驱动的兔子。短短两周内,谷歌的优秀工程师乔恩·奥万特、马修·格雷和威廉·布罗克曼就打造出了一个令人惊叹的网页版“书虫”。为了避免冗长的内部商标审批流程,我们不得不放弃这个名字。我们给它取了一个简洁、专业的标签:Ngram Viewer。2010年12月16日下午2点,《科学》杂志发表了我们的研究文章,与此同时,谷歌推出了Ngram Viewer。
Suddenly, our project was transformed from a methodical, scientific tortoise into a Google-powered hare. In two weeks flat, the amazing Google engineers Jon Orwant, Matthew Gray, and William Brockman built a stunning, Web-based version of Bookworm. To avoid the lengthy internal process for approving new trademarks, we had to ditch the name. We gave it a simple, technical label instead: the Ngram Viewer. At 2:00 p.m. on December 16, 2010, the journal Science published our research article, and simultaneously, Google launched the Ngram Viewer.
仅在第一个 24 小时内,该网站就获得了 300 万热门。互联网上一片沸腾,Twitter 上更是人声鼎沸,Ngram Viewer 的评价五花八门,从“令人上瘾”(@gbilder)到“完全上瘾”(@paulfroberts),再到“我的天哪,谷歌 ngram viewer 是史上最让人上瘾的工具”(@rachsyme)。《琼斯母亲》杂志称赞它“或许是互联网历史上最浪费时间的东西”。第二天早上,我们拿起《纽约时报》,惊讶地发现我们的作品上了头版。
In the first twenty-four hours alone, the site got three million hits. The interwebs were atwitter, and the Twitter was abuzz, with reviews of the Ngram Viewer ranging from “addictive” (@gbilder) to “totally addictive” (@paulfroberts) to “Ohmygoodness the google ngram viewer is the most addictive tool ever” (@rachsyme). Mother Jones hailed it as “perhaps the greatest timewaster in the history of the Internet.” When we picked up a copy of the New York Times the next morning, we were surprised to find our work on the front page.
问题解决了:如果我们不能摆脱对 ngrams 的沉迷,至少我们可以让世界其他地方也跟着我们一起沉迷。
Problem solved: If we couldn’t break our paralyzing addiction to ngrams, at least we could take the rest of the world down with us.
我1610年9月,伽利略开始对火星进行一系列观测。到了同年12月,他注意到一个惊人的现象:火星似乎越来越小,现在只有9月份大小的三分之一。伽利略得出结论,在短短几个月的时间里,火星已经远离地球——这是一个关键的证据,证明地球并非处于宇宙的中心。但除此之外,伽利略几乎什么也看不见。他的望远镜太原始,无法分辨行星表面的任何信息。
In September 1610, Galileo began a series of observations of the planet Mars. By December of that year, he noticed something remarkable: Mars appeared to be getting smaller and smaller, and was now only a third of its September size. Galileo concluded that, over a period of only a few months, the planet had moved much, much farther from the Earth—a crucial piece of evidence that the Earth was not at the center of the universe. But beyond that, Galileo couldn’t see much. His telescope was too primitive to resolve anything about the planet’s surface.
几个世纪后,乔瓦尼斯基亚帕雷利用一架威力更大的望远镜观测这颗红色星球。他看到的景象令人叹为观止:行星表面蚀刻出巨大的线条。斯基亚帕雷利的发现让一位名叫珀西瓦尔·洛厄尔的人兴奋不已,1894年,洛厄尔决定建造一架望远镜亲自观测。在他位于亚利桑那州弗拉格斯塔夫的天文台,洛厄尔也看到了这些线条。洛厄尔天文台的许多成员证实了他的发现。基于这些直接观测,团队绘制了精细的地图,显示这些线条形成了一个纵横交错的密集网络,覆盖了整个星球。
Some centuries later, Giovanni Schiaparelli pointed a far more powerful telescope at the red planet. What he saw was remarkable: Massive lines appeared etched into the planet’s surface. Schiaparelli’s findings so excited a man named Percival Lowell that, in 1894, Lowell decided to build a scope to see for himself. At the observatory he founded in Flagstaff, Arizona, Lowell saw the lines, too. Many members of Lowell’s observatory confirmed his findings. On the basis of these direct observations, the team made meticulous maps, showing that the lines formed a dense network crisscrossing the planet.
火星表面这些巨大的特征是什么?
What could these gargantuan features on the Martian surface be?
洛厄尔的解释基于一个世纪前就已广为人知的认知:火星除了两极的冰盖外,几乎没有水。洛厄尔认为,这些线条其实是一个巨大的运河网络,是一颗垂死星球上的居民挖掘的灌溉系统,目的是利用极地地区的水来补充水分。根据他通过望远镜看到的线条,洛厄尔得出结论:火星上存在着智慧生命。我们并非孤独的存在。
Lowell’s explanation hinged on the knowledge, already widespread a century ago, that Mars had little water except in the form of ice caps at the planet’s poles. Lowell argued that the lines were a vast network of canals, an irrigation system dug by the inhabitants of a dying planet in order to rehydrate their world using water from its polar regions. Based on the lines he saw through the telescope, Lowell concluded that Mars was home to intelligent life. We were not alone.
在科学家中,关于洛厄尔工作的争论异常激烈。许多人持怀疑态度,但也有一些人热情高涨。亨利·诺里斯·罗素,所谓的美国天文学家院长在谈到火星运河时说,“现存的理论中,也许最好的,当然也是最能激发想象力的,是洛厄尔先生和他的同事在亚利桑那州天文台提出的理论。”
Among scientists, the arguments about Lowell’s work could not have been more heated. Many were skeptical. But some were enthusiastic. Henry Norris Russell, the so-called dean of American astronomers, said of the Martian canals that “perhaps the best of the existing theories, and certainly the most stimulating to the imagination, is that proposed by Mr. Lowell and his fellow workers at his observatory in Arizona.”
洛厄尔那些令人振奋的想法的影响远远超出了科学界。这些想法通过一系列三本书广为传播,席卷了全世界。令人屏息的新闻报道迅速而猛烈地涌现。一位观察者甚至在洛厄尔的运河网络中发现了三个字母的希伯来语中神的名:Shadai。1898年,赫伯特·乔治·威尔斯写道世界大战。早在洛厄尔的发现尘埃落定之前,火星人就已经占领了地球——或者至少是它的想象。
Lowell’s electrifying ideas had an impact far beyond scientific circles. Popularized by a series of three books, they took the world by storm. The breathless news reports came fast and furious. One observer even discovered, embedded in Lowell’s canal network, the three-letter Hebrew name of God: Shadai. By 1898, H. G. Wells had written The War of the Worlds. Long before the dust settled on Lowell’s discoveries, Martians had taken over the Earth—or at least, its imagination.
到了20世纪10年代,随着望远镜观测技术的进步,科学界对洛厄尔想法的热情逐渐消退。然而,一个想法的“半衰期”很长,尤其是像这种充满趣味的想法,洛厄尔的观点和灌溉地图仍然影响深远。当美国国家航空航天局(NASA)发射第一批无人探测器拍摄这颗红色星球的照片时,用于规划此次任务的火星地球仪上精心标注了洛厄尔运河网络的标记。1964年,随着水手号探测器飞速穿越太空抵达目的地,人们对火星生命的兴奋再次达到了高潮。
Scientific enthusiasm for Lowell’s ideas had ebbed by the 1910s, in the light of better observations through better telescopes. Still, the half-life of an idea is long, especially such a fun idea, and Lowell’s opinions and irrigation maps remained influential. When NASA sent the first unmanned probes to take pictures of the red planet, the Martian globe used to plan the mission was carefully annotated with markings that showed Lowell’s canal network. In 1964, as the Mariner probes hurtled through space to their destination, excitement about life on Mars yet again reached a fever pitch.
水手4号首次飞掠这颗行星时发回的照片令人大失所望。没有运河,没有上帝的名字,没有明显的智慧生命迹象,甚至连一条洛厄尔的线路都没有。能看到的只有一片广袤的荒凉红土,偶尔点缀着几个陨石坑。
The pictures that Mariner 4 sent back on its first flyby of the planet could not have been a greater letdown. There were no canals. No name of God. No obvious signs of intelligent life. Not a single one of Lowell’s lines. All that could be seen was a vast expanse of desolate red soil, interrupted by the occasional crater.
新型望远镜的巨大潜力在于它能带我们探索未知的世界。但新型望远镜的巨大危险在于,我们过于热情地将眼见转化为心中所期盼。即使是最有力的数据,也会屈服于解读者的掌控。火星人并非来自火星:他们来自一个名叫珀西瓦尔·洛厄尔的人的头脑。
The great promise of a new scope is that it can take us to uncharted worlds. But the great danger of a new scope is that, in our enthusiasm, we too quickly pass from what our eyes see to what our mind’s eye hopes to see. Even the most powerful data yields to the sovereignty of its interpreter. Martians didn’t come from Mars: They came from the mind of a man named Percival Lowell.
透过望远镜,我们看见自己。每一个新的镜头,也是一面新的镜子。
Through our scopes, we see ourselves. Every new lens is also a new mirror.
乌托邦、反乌托邦和 DAT(A) 乌托邦
UTOPIA, DYSTOPIA, AND DAT(A)TOPIA
我在《撒母耳记》中,以色列人大卫王想知道自己麾下有多少人。他下令进行人口普查。九个月后,他得到了结果:130万身强力壮的战士。但大卫的统计结果激怒了上帝,上帝降下瘟疫降临全国。几千年来,像大卫这样的人一直试图量化他们社会的各个方面。这可能是一项危险的任务。
In the Book of Samuel, the Israelite king David wonders how many people are under his command. He orders a census. Nine months later, he gets the result: 1.3 million able-bodied fighting men. But David’s count angers the Lord, who brings a plague upon the land. For thousands of years, people like David have attempted to quantify aspects of their society. It can be a perilous undertaking.
在本书中,我们见证了数字历史记录如何以前所未有的方式量化人类群体。如今,我们不再只是数羊或数人头。相反,我们能够进行细致的测量,探究我们历史、语言和文化的重要方面。我们展示的简单图表仅仅是冰山一角。在未来几十年,个人、数字和历史记录将彻底改变我们看待自身和周围世界的方式。在结束之前,我们想概述这一切的未来走向,以及它对……的意义。科学、学术以及地平线上出现的量化社会。
In this book, we’ve seen how digital historical records are making it possible to quantify our human collective as never before. Today, we are no longer just counting sheep or counting heads. Instead, we are able to make careful measurements that probe important aspects of our history, language, and culture. And the simple charts we’ve shown are merely the tip of a vast iceberg. In the coming decades, personal, digital, and historical records are going to totally transform the way we think about ourselves and about the world around us. Before we leave you, we want to sketch out where all this is going and what it’s going to mean for science, scholarship, and the quantified society that beckons on the horizon.
最后,我们来简短地思考一下最后一个问题:这一切真的好吗?大数据会成为我们未来的福音之地吗?还是说,我们未来几年做出的决定,会反过来困扰我们?
And we will grapple, too briefly, with a final question: Is all this a good thing? Will big data turn out to be a promised land? Or could the decisions we make in the coming years come back to plague us?
我们之前提到的 ngram 数据来自数百万本书籍。以当代标准来看,这无疑是大数据。但多年以后,当我们回首往事时,我们可能会有不同的看法。毕竟,几百万本书只是我们浩瀚文化产出的一小部分。
The ngram data we’ve told you about is derived from millions of books. By contemporary standards, that’s certainly big data. But when we look back, years from now, we might think differently. After all, a couple million books is just a tiny fraction of our vast cultural output.
考虑一下这样的历史人物埃德加·爱伦·坡。与许多早期作家不同,坡努力仅靠写作谋生。但由于缺乏国际版权法,十九世纪的作家很难谋生。迫于经济压力,坡尽可能地在各种平台和形式上发表作品。他创作诗歌、短篇小说、书籍、戏剧、长篇小说、评论、报纸文章、散文和书信。他甚至编造了一个关于横跨大西洋热气球航行的荒诞故事,并设法将其发表在《纽约太阳报》的特刊上。
Consider a historical figure like Edgar Allan Poe. Unlike many earlier writers, Poe strove to support himself solely via writing. But in the absence of international copyright law, it was hard for a nineteenth-century author to make a living. Driven by pressing financial needs, Poe published his works wherever he could, in an extraordinary array of forums and formats. He wrote poems, short stories, books, plays, novels, reviews, newspaper articles, essays, and letters. He even fabricated a tall tale about a transatlantic balloon voyage, which he managed to get published in a special edition of the New York Sun.
当我们思考历史记录的未来,以及数字化将如何改变它时,坡的作品就像一份待办事项清单。他的哪些作品已经进入了数字共享?它们是如何实现的?其余部分又如何?这些问题将带领我们对现存的历史记录进行一次简短而快速的游览。
When we think of the future of the historical record, and of how digitization will transform it, Poe’s works read like a to-do list. Which parts of his oeuvre have made it to the digital commons? How did they get there? And what about the rest? These questions will lead us on a brief, whirlwind tour of the historical record as it exists today.
书籍。我们的小工具 Ngram Viewer 最初仅覆盖了所有已出版书籍的 4%,即每二十五本书中就有一本。2012 年,我们帮助 Yuri Lin、Slav Petrov 和 Google 的其他员工升级了 Ngram Viewer,使其覆盖范围达到所有书籍的 6% 左右,即每十七本书中就有一本。当然,我们只使用了所有书籍中的一小部分。谷歌已经数字化的电子书。如果把这三千万本都算上,只占总数的20%多一点。剩下的80%怎么办?它们什么时候才能被收录到数字档案里?
Books. Our little scope, the Ngram Viewer, was initially powered by 4 percent of all books ever published, or one in twenty-five. In 2012, we helped Yuri Lin, Slav Petrov, and others at Google upgrade the Ngram Viewer to cover about 6 percent of all books, or one in seventeen. Of course, we use only a subset of all the books that Google has digitized. If you include all thirty million of those books, it comes out to a little more than 20 percent of the total. What about the remaining 80 percent? When will they make it into digital archives?
越来越多的新书从一开始就以数字形式存在,从出版之日起就以电子书的形式发行。由于当今图书出版数量远超人类历史上任何时期,这意味着以数字形式存在的图书比例正在日益快速增长。
Conveniently, an increasingly large fraction of new books are born digital, distributed as e-books from the moment of publication. Since books are being published in far larger numbers now than at any point in human history, this means that the fraction of books existing in digital form is growing rapidly with each passing day.
但我们仍然保留着一些老书,它们现在只以实体形式存在,这多少有些不便。数字化工作的大部分重点将放在这些老书上。私营企业和政府正在加紧努力,既要保护我们的集体遗产,又要从中获利。谷歌继续引领着这项工作。在现存的1.3亿册图书中,谷歌已经完成了超过3000万册的数字化。该公司预测,到2020年,剩余的图书将全部完成数字化。未来,绝大多数现存的图书很可能很快就会以数字形式记录下来。
That still leaves us with older books, which, somewhat inconveniently, only exist as physical objects. This is where most of the digitization effort will be concentrated. Private corporations and governments are stepping up to the plate, motivated by the desire to both preserve our collective heritage and profit from it. Google continues to lead this effort. It has already digitized more than 30 million of the 130 million books in existence. The company forecasts that it will be done with the rest by 2020. In all likelihood, the vast majority of surviving books will soon be recorded in digital form.
从数量上看,我们的图书记录覆盖率提高了 25 倍,从 4% 提高到 100百分比,将对我们用文化望远镜所能进行的观测类型产生重大影响。再想想伽利略,他用一架比肉眼好三十倍的望远镜,就把地球从宇宙中心的位置踢了出去。
From a quantitative standpoint, this twenty-five-fold improvement in our coverage of the book record, from 4 percent to 100 percent, will make a big difference in terms of the kinds of observations we can make with a cultural telescope. Think again of Galileo, who kicked the Earth out of its perch at the center of the universe with a telescope that was only thirty times better than the naked eye.
尽管如此,我们的书籍记录研究仍面临重大障碍。
Despite this, the study of our book record faces major hurdles.
一个严重的障碍是版权法。如今的版权立法比爱伦·坡时代任何时候都更具侵略性,而且同样过时,使该领域陷入困境。1998 年的《版权期限延长法》就是一个很好的例子。该法案将版权期限延长至作者去世后 70 年。它实际上禁止了几乎所有 1923 年后出版的书籍的在线传播,并且没有为数字研究或数字图书馆的兴起做出任何规定。互联网档案馆、HathiTrust 和古腾堡计划等组织正在努力使书籍尽可能公开可用。但由于版权立法的现状,他们对上个世纪出版的作品无能为力。
One serious impediment is copyright law. More aggressive today than it ever was in Poe’s time, and just as obsolete, copyright legislation has left the field hamstrung. The Copyright Term Extension Act of 1998 is a good example. This act extended copyright for seventy years after the author’s death. It effectively prohibited the online dissemination of nearly all books published after 1923, and it made no provisions for digital research or for the rise of digital libraries. Organizations like the Internet Archive, the HathiTrust, and Project Gutenberg are striving to make books available as openly as possible. But because of the state of copyright legislation, they can do very little about works published in the last century.
这影响了我们信息生态系统的其他部分。例如,我们的研究小组“文化观察站”创建了一些开源工具,它们比Ngram Viewer强大得多,能够以各种方式对书籍记录进行切片和细分。我们可以立即绘制出“ raven”一词在美国各地的用法图谱,这些图谱由三十多岁的男性创作。但仅限于1923年。至于上个世纪,除非有新的法律允许进入,否则那位始终在我们家门口的律师——身穿黑袍的哨兵——仍会低声说:“永不复返!”
This affects the rest of our information ecosystem. For instance, our research group, the Cultural Observatory, has created open-source tools that are far more powerful than the Ngram Viewer, capable of slicing and dicing the book record in all sorts of ways. We can instantly map the usage of the word raven across the United States, in works of narrative poetry, written by men in their thirties. But only up to 1923. When it comes to the last century, save if new law affords entry, then the lawyer—dark-robed sentry—who is ever at our door, will yet whisper, “Nevermore!”
书籍还面临着另一个更为隐蔽的危险。随着数字书籍和数字信息变得越来越重要,实体书的生存正受到威胁。多个方面。在推出 Kindle 电子书阅读器平台仅三年后,亚马逊 Kindle 电子书的销量就开始超过纸质书。而且不仅仅是亚马逊:近年来,各种平台和零售商都出现了向电子书的强劲转变。从长远来看,像《圣经》这样具有重要意义和情感价值的文本肯定会继续印刷。但这样的文本毕竟是少数。对于这种 Zipfian 分布的长尾而言,书籍印刷将步不规则动词的后尘。几年后,像这样的书籍将不再印刷。
And there’s another, far more insidious danger that books face. As digital books and digital information become increasingly important, the survival of physical books is being threatened on several fronts. Only three years after introducing the Kindle e-book reader platform, sales of Kindle books at Amazon began to outstrip the sale of printed volumes. And it’s not just Amazon: There has been a compelling shift toward e-books in recent years, across a variety of platforms and retailers. In the long run, texts of great importance and sentimental value, like the Bible, will surely remain in print. But such texts are few. For the long tail of this Zipfian distribution, book printing will go the way of the irregular verb. In a few years, books like this one will no longer be printed.
纸质书籍曾经的堡垒——图书馆,如今也正面临威胁。几千年来,图书馆一直是保存历史记录的最重要机构。然而,即使在线图书馆继续取得长足进步,传统的实体图书馆却面临着大幅削减开支的困境。近年来,60% 的图书馆预算持平或下降。资金紧张,空间更加紧张,图书馆不得不处理旧书,腾出空间存放新书。问题在于,图书馆不能轻易地将旧书送人。为了防止书籍被盗,书籍上安装的追踪装置总能引导好心人找到书籍并立即归还。移除这些追踪装置的成本太高。于是,图书馆经常选择做一些我们可能认为难以想象的事情:秘密销毁书籍。这种情况的发生规模令人震惊。大型图书馆有时一次就能销毁数十万册书籍。
Physical books are also being threatened in what used to be their citadel: the library. The library has, for thousands of years, been the single most important institution working to preserve the historical record. Yet even as online libraries continue to make great strides, their traditional, brick-and-mortar counterparts are facing significant cutbacks. In recent years, 60 percent have faced flat or declining budgets. With funds tight and space even tighter, libraries have no choice but to get rid of old books to make room for new ones. The trouble is that libraries can’t just give their old books away. The tracking equipment that is installed in books to keep them from being stolen invariably leads kind souls to find the books and bring them right back. Removing these trackers is too expensive. Instead, libraries are routinely choosing to do something that we might have thought was unimaginable: They are secretly destroying books. This is happening at an astonishing scale. Large libraries sometimes dispose of hundreds of thousands of books at a time.
哪些书会被丢弃?各个图书馆的做法各不相同,但总体来说,这是一个相当随意的过程。没有人努力去追踪我们丢失了哪些书。最近的一个案例是,英国前首相的私人图书馆里有几本书被丢弃。大卫·劳合·乔治的作品被糟蹋了。有时,图书馆只需查看谷歌数字化了哪些书,就能决定丢弃哪些书。结果,我们文化遗产中的重要部分遭受了全面攻击。前几章我们指出,审查制度可能会出乎意料地支持一种理念。但在这里,情况却截然相反:让书籍更广泛传播的努力正在威胁这些书籍的实体存续。图书数字化将留下复杂的遗产。
Which books go? Practices vary from library to library, but it’s generally a pretty indiscriminate process. There has been no effort to keep track of what we are losing. In one recent case, volumes from the personal library of the former British prime minister David Lloyd George were trashed. Occasionally, a library will decide which books to get rid of by just checking which books Google has digitized. The result is an all-out assault on a significant slice of our cultural heritage. A few chapters back we pointed out that censorship can unexpectedly prop up an idea. Here, the opposite is happening: An effort to make books more widely available is threatening the physical survival of those very books. Book digitization will leave a complex legacy.
报纸。当然,历史记录不仅仅包括书籍。例如,坡的气球骗局就出现在报纸上。历史报纸是一种宝贵的资源,反映了城市、运动和其他社会群体的日常关注。我们找到坡的气球骗局的电子版的可能性有多大?
Newspapers. Of course, the historical record consists of more than just books. Poe’s balloon hoax, for instance, appeared in a newspaper. Historical newspapers are an extraordinary resource, reflecting the day-to-day concerns of cities, movements, and other social groups. What are our chances of finding a digital edition of Poe’s balloon hoax?
乍一看,我们或许觉得机会很大。旧报纸的数字化进程已取得重大进展。如今,像《纽约时报》、《波士顿环球报》等许多主流报纸都已将其全部档案数字化。美国国家人文基金会资助了一项大规模的早期美国报纸数字化项目,涵盖了600万页内容,时间跨度超过一个世纪。其他国家也取得了进展。仅澳大利亚的Trove项目就已将大约1亿份报纸文章数字化。甚至谷歌也曾短暂加入,将2000份报纸的档案数字化。
At first glance, we might think the chances are pretty good. Digitization of old newspapers has made significant inroads. Today, major papers like the New York Times, the Boston Globe, and many others have digitized their full archives. The National Endowment for the Humanities has funded a large effort to digitize Early American newspapers, covering six million pages that span more than a century. Other nations have been making progress, too. Australia’s Trove project alone has digitized about one hundred million newspaper articles. Even Google briefly entered the fray, digitizing the archives of two thousand newspapers.
但尽管取得了这些令人印象深刻的进步,报纸数字化工作的规模和覆盖范围可与谷歌对图书的数字化工作相媲美。
But despite these impressive strides, no newspaper digitization effort is comparable in scale and coverage to what Google is doing for books.
爱伦·坡的气球骗局就是这种差异的一个完美例证。如今很容易找到这个骗局的数字版。但这是因为书籍数字化的成功,而不是报纸数字化的成功。这个荒诞故事非常出名,甚至出现在许多坡作品选集中。这些选集以及坡的所有书籍都已被数字化。
Poe’s balloon hoax is a perfect example of this disparity. It is easy to find a digital edition of the hoax today. But that’s because of the success of book digitization, not newspaper digitization. The tall tale is so famous that it appears in many books that anthologize Poe’s work. These, along with all of Poe’s books, have been digitized.
但你找不到最初刊登该故事的报纸的电子版。国家人文基金会(NEH)只资助了1859年至1920年《纽约太阳报》的数字化工作。这篇发表于1844年的恶作剧文章,恰好落入了报纸数字化的众多盲区之一。坡的大部分报纸文章尚未数字化,而且无人知晓何时会实现数字化。
But you can’t find a digital copy of the newspaper that originally printed the story. The NEH has only funded digitization of the New York Sun from 1859 to 1920. The hoax, published in 1844, falls into one of the many vast blind spots of newspaper digitization. Most of Poe’s newspaper articles have not been digitized, and no one knows when they will be.
未出版的文本。出版本身是一项相对较新的发明。在印刷机出现之前,文本以手抄本的形式流传,由手写和抄写。如今,许多精彩的文本仅以这种形式流传至今。许多著名的手抄本,例如《死海古卷》,以及一些重要的藏品,例如大英图书馆的希腊手抄本,都已被数字化。但系统性地将手抄本数字化的努力,其规模相当有限。
Unpublished Text. Publishing itself is a relatively recent invention. Before the printing press, texts circulated as manuscripts, written and copied by hand. Today, a lot of wonderful texts survive only in this form. Many famous manuscripts, like the Dead Sea Scrolls, have been digitized, as have important collections, like the Greek manuscripts at the British Library. But systematic efforts to digitize manuscripts have been fairly local in scope.
当然,未出版文本的产生并没有随着出版技术的发明而停止。坡留下了422封信。他的这些信件已被数字化,但就像他的气球骗局一样,只是因为他名声显赫,才被收集成书。其他关于坡的资料也以坡为中心进行了数字化,例如德克萨斯大学奥斯汀分校的哈里·兰塞姆中心。在那里,你可以找到坡的一些原始手稿、写给他的信件以及他遗弃的作品的数字图像。你甚至可以看到一些埃德加·爱伦·坡的香烟卡——在棒球卡占领这个特殊的文化领域之前,印有演员、模特和作家的卡片也曾为烟草销售做出过贡献。
Of course, the production of unpublished texts didn’t stop with the invention of publication. Poe left 422 letters behind. In his case, the letters have been digitized, but as with his balloon hoax, only because he was so famous that they had been collected in books. Other material by and about Poe has been digitized in Poe-centric efforts, like one at the University of Texas at Austin’s Harry Ransom Center. There you can find digital images of some of Poe’s original manuscripts, letters that were written to him, and works that he abandoned. You can even see a few Edgar Allan Poe cigarette cards—before baseball cards took over this peculiar cultural niche, cards featuring actors, models, and authors did their part to help sell tobacco.
但就未出版的作品而言,坡的遗产并不十分具有代表性。像坡这样的人受益于一种明星效应历史记录中没有得到妥善处理。任何与他们相关的资料都倾向于被追踪并数字化。其他人呢?99% 的笔记、日记和信件通常埋藏在阁楼和旧箱子里,很难找到,直接进行数字化处理也只是少数。
But when it comes to unpublished material, Poe’s legacy isn’t very representative. Folks like Poe benefit from a kind of star treatment in the historical record. Anything related to them tends to be tracked down and digitized. What about everyone else? Buried in attics and old trunks, the notes, journals, and correspondence of the 99 percent are usually very hard to get at, and direct efforts to digitize them are the rare exception.
哈佛大学研究伊朗女性的阿夫萨内·纳吉马巴迪(Afsaneh Najmabadi)是少数成功发掘此类资料的案例之一。她在伊朗挨家挨户走访,询问每个家庭是否保存了与女性经历相关的历史文献。纳吉马巴迪将她发现的所有资料精心制作成数字图像。最终,这份名为“卡扎尔王朝伊朗女性世界”的档案可在www.qajarwomen.org网站免费查阅。它堪称一座宝库,囊括了从遗嘱到明信片再到婚约等各种资料。所有社群都拥有这样的宝藏。但时间正在慢慢地将它们消磨殆尽。令人遗憾的是,目前尚无系统性措施来阻止这一进程。
One of the few examples of a successful effort to unearth material of this sort was undertaken by Afsaneh Najmabadi, a Harvard faculty member who studies Iranian women. She went door to door in Iran, asking families if they had preserved any historical documents related to the experience of women. Najmabadi carefully created digital images of everything she found. The result, the Women’s Worlds in Qajar Iran archive, is freely available at www.qajarwomen.org. It is a treasure trove of everything from wills to postcards to marriage contracts. All communities have such treasures. But time is slowly leeching them away. Sadly, there is no systematic effort to stop that process.
实物。在弗吉尼亚州里士满,爱伦·坡故居附近,矗立着埃德加·爱伦·坡博物馆。在那里,你可以看到他的手杖、童年的床、一些旧衣服、他妻子的钢琴、养父的肖像,甚至还有他的一缕头发。这样的博物馆提醒我们,人类历史远非文字所能描述。历史也存在于我们绘制的地图和雕刻的雕塑中;存在于我们建造的房屋、耕耘的田地和穿着的服饰中;存在于我们吃的食物、演奏的音乐和信仰的神灵中;存在于我们绘制的洞穴和先于我们出现的生物化石中。
Physical Objects. Near Poe’s old home in Richmond, Virginia, stands the Edgar Allan Poe Museum, where you can see his walking stick, his boyhood bed, some of his old clothes, his wife’s piano, a portrait of his foster father, and even a lock of his hair. Such museums remind us that human history is much more than words can tell. History is also found in the maps we drew and the sculptures we crafted. It’s in the houses we built, the fields we kept, and the clothes we wore. It’s in the food we ate, the music we played, and the gods we believed in. It’s in the caves we painted and the fossils of the creatures that came before us.
不可避免地,这些资料中的大部分都会丢失:我们的创造力远远超过了我们记录的保存能力。但如今,能够保存下来的资料比以往任何时候都多。像欧洲数字图书馆这样的项目致力于从博物馆、档案馆和……收集数百万件文化文物。遍布欧洲的资料库,以数字形式在网络上提供。艺术品可以以极高的分辨率拍摄,以二维甚至三维的形式呈现,这使得像 www.artsy.net 这样的网站能够帮助更多人欣赏世界上一些最重要的艺术作品。你真的喜欢那件新石器时代的陶器吗?如今,你可以对其进行三维扫描,然后用 3D 打印机打印出复制品。
Inevitably, most of this material will be lost: Our creativity far outstrips our record keeping. But today, more of it can be preserved than ever before. Projects like Europeana strive to make millions of cultural artifacts, drawn from museums, archives, and repositories all over Europe, available in digital form on the Web. Artworks can be photographed at an extraordinarily high resolution, in two or even three dimensions, enabling sites like www.artsy.net to help large numbers of people see some of the world’s most important works. Do you really like that piece of Neolithic pottery? Today you can scan it in three dimensions, and use a 3-D printer to print out a replica later.
在历史消失之前,我们能捕捉到多少?为了有所作为,我们需要胸怀大志。
How much of our history will we capture before it disappears? To make a difference, we need to think big.
我们已经生活在一个大科学时代。大型强子对撞机及其对希格斯玻色子的探索耗资90亿美元。人类基因组计划的目标是确定构成人类生命化学密码的字母序列,耗资30亿美元。我们用于理解人类历史的资金则少得多:美国国家人文基金会的年度预算总额约为1.5亿美元。
We already live in an era of big science. The Large Hadron Collider, and its quest for the Higgs boson, cost $9 billion. The Human Genome Project, whose goal was to determine the sequence of letters that spell out the chemical code underlying human life, cost $3 billion. The amount of money we put into understanding human history is far smaller: The entire annual budget of the National Endowment for the Humanities is about $150 million.
数字化历史记录的问题,为人文学科中类似大科学的工作带来了前所未有的机遇。如果我们能够证明数十亿美元的科学项目是合理的,那么我们也应该考虑一个数十亿美元项目的潜在影响,该项目旨在记录、保存和分享我们历史上最重要、最脆弱的部分,并让我们自己和我们的子孙后代都能广泛获取它们。通过合作,科学家、人文学者和工程师团队可以创造出拥有非凡力量的共享资源。这些努力很容易为未来的谷歌和脸书埋下种子。毕竟,这两家公司最初都致力于将我们社会的各个方面数字化。大人文学科正在等待着它的到来。
The problem of digitizing the historical record represents an unprecedented opportunity for big-science-style work in the humanities. If we can justify multibillion-dollar projects in the sciences, we should also consider the potential impact of a multibillion-dollar project aimed at recording, preserving, and sharing the most important and fragile tranches of our history to make them widely available for ourselves and our children. By working together, teams of scientists, humanists, and engineers can create shared resources of extraordinary power. These efforts could easily seed the Googles and Facebooks of tomorrow. After all, both these companies started as efforts to digitize aspects of our society. Big humanities is waiting to happen.
尽管还有大量工作要做,但历史记录已经取得了显著进展。我们刚才描述的那些资源,只需点击一下按钮即可获得,这正在改变我们对过去的理解,让我们能够定期与孩子们分享过去需要亲自前往卢浮宫或史密森尼博物馆才能看到的东西。这些资源将改变科学家和人文学者研究过去的方式,帮助我们观察和理解文字和艺术、头发和明信片、战争和浪漫是如何发展到今天的。
Still, despite the vast amount of work left to do, digitization of the historical record has already made significant progress. Having the kinds of resources that we just described available at the click of a button is transforming our appreciation of the past, making it possible to routinely share with our children things that once would have required a trip to the Louvre or the Smithsonian. These resources are going to transform how scientists and humanists approach the past by helping us observe and understand how writing and art, hair and postcards, warfare and romance got to where they are today.
埃德加·爱伦·坡开创了侦探小说这一体裁,其戏剧性引擎在于看似平凡的人背后却隐藏着最黑暗的秘密。假设你是一位历史侦探,想要探寻坡的秘密:他的内心世界,他最隐秘的想法。一个好的切入点是阅读他的私人信件。他留下的422封引人入胜的信件正等待着我们去探索。
Edgar Allan Poe invented the detective story, a genre whose dramatic engine is the fact that ordinary-seeming people can conceal the darkest of secrets. Suppose you were a historical sleuth who wanted to know Poe’s dark secrets: his inner life, his most guarded thoughts. A great place to start would be to look at his personal correspondence. The 422 fascinating letters he left us are just waiting to be explored.
但你知道谁是比坡记录更详尽的作家吗?你就是。如果你是普通的美国成年人,你每隔一周就会发送422封电子邮件。而你的账户里现在可能就存储着十年的电子邮件。这比坡留下的所有信件还要多数百倍。而且,拥有这份精彩档案的并非只有你:2010年,20亿电子邮件用户发送了10万亿封电子邮件,这还不包括垃圾邮件。如今,普通人的信件保存得比大多数已故总统的信件还要好。
But you know who is a much better-documented writer than Poe? You are. If you are the average American adult, you send 422 e-mails every other week. And you probably have a decade’s worth of e-mail living in your account right now. That’s hundreds of times as much material as all the correspondence that survives from Poe. And it’s not just you who has this fantastic archive: In 2010, two billion e-mailers sent ten trillion e-mails, excluding spam. Today, the average Joe’s correspondence is better preserved than the missives of most bygone presidents.
这些电子邮件记录是宝贵的资源。它们不仅记录我们过去的细节,但它们也使我们能够以令人兴奋的新方式了解自己。以 JB 的电子邮件为例。对他的邮箱进行简单的 ngram 分析可以告诉你很多关于 JB 生活的信息。多年来,你可以看到他逐渐从法语转向英语,这反映了他从法国移居美国后文化的适应。友谊来来去去。年轻人的热情会消退:十多年来派对的频率减少了。与此同时,他的爱情生活展开,汇聚成最后一个 ngram:Ina。通过这种方式探索他的 ngram,JB 反复重新发现曾经对他很重要但慢慢被遗忘的东西。大数据不一定令人生畏。它可以成为我们自己生活的一扇私密窗口。进入我们量化的自我。
Those e-mail records are a powerful resource. Not only do they document the details of our past, but they also make it possible for us to learn about ourselves in exciting new ways. Take JB’s e-mail. A simple ngram analysis of his mailbox can tell you a great deal about JB’s life. Over the years, you can see the gradual shift away from French and toward English, reflecting acculturation to the United States after he moved from France. Friendships come and go. Youthful enthusiasms fade: party decreases in frequency over a decade. At the same time, his love life unfolds, converging on a final ngram: Ina. Exploring his ngrams in this way, JB repeatedly rediscovered things that had once been important to him, but were slowly forgotten. Big data doesn’t have to be daunting. It can be an intimate window into our own lives. Into our quantified selves.
我们的数字记忆远不止书信往来。除了一万五千封电子邮件外,平均每个人每年还会发送或接收五千个电子邮件附件。他们会“点赞”大约140个内容。他们会在Facebook上上传十八张照片,在Instagram上上传两张照片。他们会发九条推文。他们会在YouTube上上传二十秒的视频。他们会在Dropbox上上传五十二个文件。他们会在社交网络上与四十三个好友互动。然而,这些令人印象深刻的平均值还没有计算我们创作但未在线分享的所有图像、文档、视频和音乐。更没有考虑到全球近四分之三的人口仍然无法接入互联网。
Our digital memories extend far beyond correspondence. Along with fifteen thousand e-mails, the average person sends or receives five thousand e-mail attachments each year. They “like” about 140 things. They upload eighteen pictures to Facebook, and two more to Instagram. They tweet nine times. They put up twenty seconds of video on YouTube. They upload fifty-two files to Dropbox. They interact with forty-three friends on an online social network. And these impressive averages don’t account for all the images, documents, videos, and music that we create but don’t share online. And they don’t account for the fact that nearly three-quarters of the world’s population still lacks Internet access.
所有这些资料加在一起,构成了对数十亿人生活的惊人详尽记录——这样的记录在几十年前根本不存在。这在人类历史上史无前例。我们文明每小时发布的推文文字量,比古希腊现存所有文献的总和还要多。与今天的普通人相比,像坡这样的人简直是个谜。
Taken together, this material comprises an astonishingly detailed record of the lives of billions of people—a record that did not exist at all mere decades ago. It has no precedent in human history. Our civilization tweets more words every hour than can be found in all the surviving texts of ancient Greece. Compared to the average person today, a man like Poe is an enigma.
Yet compared to the people of tomorrow, the people of today are a total mystery.
在本书的开篇,我们曾提到,如今普通人每年产生的数据量略低于1TB。但有些人的数据量却高于平均水平。其中一位就是住在波士顿的幼儿德韦恩·罗伊 (Dwayne Roy)。他通常一个周末就能产生这么多数据。
At the beginning of this book, we told you that the average person alive today produces a little less than one terabyte of data each year. But some people are above average. One of these people is Dwayne Roy, a toddler living in Boston. He regularly produces that much data in a single weekend.
为什么德韦恩能创造出这么多的比特?德韦恩是麻省理工学院媒体实验室认知机器小组负责人德布·罗伊教授和东北大学研究言语病理学的鲁帕尔·帕特尔教授的儿子。两人都对儿童如何学习说话非常感兴趣。德韦恩很在意,因为这正是她的学科所在。鲁帕尔也很在意,因为他想用同样的原理教机器人用普通的人类语言交流。这对夫妇意识到,理解儿童如何习得语言的核心挑战之一是缺乏数据。没有人详细记录过儿童在成长过程中接触语言的所有方式。
Why does Dwayne produce so many bits? Dwayne is the son of Professor Deb Roy, who runs the Cognitive Machines Group at the MIT Media Lab, and Professor Rupal Patel, who studies speech pathology at Northeastern. Both are fascinated by how children learn to speak. She cares, because it’s exactly what her discipline is about. He cares, because he wants to use the same principles to teach robots how to communicate in ordinary human language. The couple realized that one of the central challenges in understanding how children acquire language is a lack of data. No one had documented, in detail, all the ways in which children are exposed to language as they grow up.
Patel 怀孕后,夫妻俩决定正面解决这个问题,全面记录新生儿出生后的头三年。Roy 获得了美国国家科学基金会(NSF)的资助,用于“人类语音组计划”(Human Speechome Project),在家中安装了 11 台高分辨率摄像机和 14 个麦克风。长达 3000 英尺的电缆将这些设备连接到位于地下室的数据中心。每天,这个地下室哨所都会存储关于德韦恩的信息超过三百千兆字节。他迈出的每一步,发出的每一个声音,听到的每一个声音,看到的每一个景象——所有这些都被记录下来,用于科学研究。(当婴儿睡着时,摄像头就会关闭,当他不在家时,显然无法追踪他。)
When Patel became pregnant, the pair decided to tackle this problem head-on by comprehensively recording the first three years of their new baby’s life. Funded by a grant from the National Science Foundation for what Roy called the Human Speechome Project, he outfitted their family’s home with eleven high-resolution video cameras and fourteen microphones. Three thousand feet of cable connect these devices to a data center that lives in their basement. Each day, this basement outpost stores more than three hundred gigabytes of information about Dwayne. Every step he takes, every noise he makes, every sound he hears, and every sight he sees—all of it is recorded for the benefit of science. (The cameras shut down when the baby is asleep, and obviously can’t track him when he’s out of the house.)
随着海量信息的涌入,地下室的数据中心很容易被淹没。正因如此,老罗伊不得不定期带着装满硬盘的箱子,将这些信息永久存档在他公司搭建的一套功能更强大的计算机系统中。为了追踪一个小男孩,他动用了一个价值数百万美元的CPU网格,并配备了一个能够存储PB(即百万GB)数据的巨型磁盘阵列。这套系统的名字也恰如其分地描述了它的工作方式:全面回忆。
With so much information pouring in, the basement data center tends to flood. That’s why the elder Roy has to regularly take suitcases full of hard disks to be permanently archived on a far more powerful computer system that he has constructed at work. To track one small boy, he uses a multimillion-dollar CPU grid outfitted with a massive disk array capable of storing a petabyte, or one million gigabytes. The system’s name doubles as its job description: TotalRecall.
如今,德韦恩·罗伊是个例外。并非每个人都会尝试拍摄并保存自己一生的视频。但随着数字媒体与人类生活日益深入的交融,这种记录将会变得司空见惯。
Today, Dwayne Roy is an exception. Not everyone is the subject of an attempt to record and preserve a video feed of his entire life. But as digital media and human life interpenetrate ever more deeply, this sort of record will become commonplace.
我们已经预见到将引领这一变革的设备类型。谷歌最近推出了谷歌眼镜 (Glass),这是一款安装在眼镜上的现实增强系统,它配备了一个网络摄像头,可以追踪视野内的一切,并配有一个小型显示器,可以实时提供你所见所闻的相关信息。烤蛋糕?这副眼镜或许能帮你找到食谱,并在你烘烤过程中显示操作说明。不认识刚走到你面前的那个人?没问题——谷歌眼镜可以通过人脸识别技术提醒你。当然,这副眼镜看起来有点傻。但你还记得在手机发展的早期,人们自言自语的样子有多傻吗?无论谷歌眼镜最终能否流行起来,这种技术都必将拥有光明的未来。
We can already see the types of devices that will usher in this transformation. Google recently introduced Glass, an eyeglass-mounted reality augmentation system that features a webcam tracking everything in your field of view and a small monitor to provide you with relevant information about what you’re seeing and doing in real time. Baking a cake? The glasses might figure that out, find the recipe, and show you instructions as you progress. Don’t recognize that guy who just walked up to you? No problem—using face recognition, Google Glass could remind you. Sure, the glasses look a bit silly. But do you remember how silly people looked, talking aloud to themselves, in the early days of the cell phone? Whether or not Google Glass ever takes off, this kind of technology is certain to have a bright future.
这样的设备让德韦恩·罗伊 (Dwayne Roy) 成为记录生活变得轻而易举。起初,几乎没有人会对这种事感兴趣——这是对隐私的终极侵犯。但互联网从一开始就在重新定义隐私规范,诱导人们传播越来越多的个人信息,无论是在博客上记录我们的日常想法,还是宣布我们的感情状况。我们知道这个故事的结局:不可避免地,有些人会自愿开始记录他们的整个生活,而网站会应运而生,帮助他们传播这些信息。
Such devices make Dwayne Roy–grade life logging easy. At first, almost no one will be interested in doing that sort of thing—it is the ultimate breach of privacy. But the Internet has been redefining privacy norms from the beginning, inducing people to broadcast an ever-increasing amount of personal information, whether it be blogging our daily thoughts or announcing our relationship status. We know how this story ends: Inevitably, some people will voluntarily start to record their entire lives, and Web sites will pop up to help them with the distribution.
有一些显而易见的好处。有了生活日志,你永远不会忘记任何事——你可以随时查阅你曾经有过的每一次感官体验。这可能是件好事。(有时如此。)它也可能让你更安全。毕竟,如果犯罪现场直播,谁会伤害别人呢?你可以获得实时的生活指导,世界各地的人会不间断地指导你下一步该做什么。(转念一想,这很快就会变得烦人。)偶尔,你可能会关闭生活日志,为了享受亲密时光或去洗手间而关闭你的生活日志。大多数人可能会这样做。有些人不会。
There are some obvious benefits. With a life log, you’d never irretrievably forget anything—you could just look up every sensory experience you’ve ever had. That can be a good thing. (Sometimes.) It might make you safer, too. After all, who would harm someone if the crime was being aired live? You could have real-time life coaching, with people around the world giving you nonstop advice about what to do next. (On second thought, that could rapidly become annoying.) Occasionally, you might go offlog, disabling your life logger for an intimate moment or a bathroom break. Most people will probably do this. Some will not.
生活记录既是观察我们身体的窗口,也是观察我们所居住的世界的窗口。像 Nike+ FuelBand 和 Fitbit 这样的可穿戴电子设备已经能够全天记录你走了多少步、爬了多少楼梯以及燃烧了多少卡路里。一款名为 Scanadu Scout 的设备则更加雄心勃勃:Scout 是一个小型手持磁盘,它可以在几秒钟内追踪并记录你的体温、心率和血氧水平。它还可以进行心电图检查,甚至分析你的尿液。Scout 基本上是人类对“星际迷航”式三录仪的初稿。这些数据将确保生活记录也能作为医疗记录,其中包含关于……的详细信息。所有维持我们身体运转的无意识过程。如果出现问题,日志会立即通知护理人员。如今每年去看医生进行体检的模式将被彻底颠覆。使用基于三录仪的远程医疗,医疗保健提供者将能够追踪您每天、全天的健康状况。如果出现任何异常,他们给您打电话的可能性与您给他们打电话的可能性一样大。
Life logging will be as much a window on our bodies as it is a window on the world we inhabit. Wearable electronics like the Nike+ FuelBand and the Fitbit already keep track of how many steps you’ve taken, how many stairs you’ve climbed, and how many calories you’ve burned, all day long. A gadget called the Scanadu Scout is more ambitious: A small, handheld disk, the Scout tracks and records your body temperature, heart rate, and blood oxygen levels, all in seconds. It can also perform an electrocardiogram and even analyze your urine. Basically, the Scout is humanity’s first draft of a Star Trek–style tricorder. Such data will ensure that life logs also serve as medical records, saturated with details about all the unconscious processes that keep our bodies going. If something goes wrong, the log will immediately notify caregivers. Today’s paradigm of visiting the doctor for an annual checkup will be turned on its head. Using tricorder-based telemedicine, health care providers will be able to track how you’re doing all day, every day. If something seems amiss, they’ll be just as likely to call you as you are to call them.
生活记录能让我们记录下发生在我们身上的、无论是体内还是体外的大量事件。但是,所有体验中最短暂的——人类的思想——又该如何记录呢?
Life logging will allow us to record a staggering fraction of what happens to us, both inside and outside our bodies. But what about the most evanescent of all experiences: human thought?
我们认为,科幻小说中那些能够不由自主地记录用户每一个想法的读心术不太可能在短期内成为现实。问题在于,训练机器理解普通脑电波非常困难。但或许存在一种有效的解决方法。在过去十年左右的时间里,科学家们已经成功开发出心机接口技术使瘫痪者能够利用意念的力量操控假肢,或通过无线方式发送意念指令来操控电脑鼠标。这类接口已被用于与那些按照普通医学定义处于昏迷状态的人进行交流。它们甚至被应用于玩具领域。
We think that the mind-reading gizmos of science fiction, capable of involuntarily transcribing a user’s every thought, are unlikely to become a reality anytime soon. The problem is that it is hard to train a machine to make sense of ordinary brain waves. But there may be a powerful work-around. In the last decade or so, scientists have been successfully developing mind-machine interfaces that enable paralyzed individuals to move a prosthetic limb with the power of thought, or to wirelessly broadcast a mental command that moves a computer mouse. Such interfaces have been used to communicate with people who appear, by the ordinary medical definition, to be comatose. They’re even making their way into toys.
这些接口依赖于这样一个事实:尽管普通的脑电波会让机械窃听者感到困惑,但我们可以训练大脑,让脑电波的活动对机器更加透明。这是通过主动生成机器能够识别的特定神经信号来实现的。在每一个这样的接口中——无论是追踪脑血流的功能磁共振扫描仪、追踪脑电活动的脑电图,还是连接到一小群脑细胞的神经植入物——所有机器所做的就是寻找一个约定的信号,并以预先编程的方式做出响应。这种方法已经取得了巨大的成功。不难想象,这样的系统将使我们能够用意念来操作电器,甚至相互发送信息。而这或许仅仅是个开始。
These interfaces rely on the fact that, although ordinary brain waves are confusing to a mechanical eavesdropper, we can train our brains to make their activity more transparent to a machine. This is accomplished by voluntarily generating specific neural cues that a machine is capable of recognizing. In every such interface—whether it’s an fMRI scanner tracking blood flow in the brain, an electroencephalogram tracking electrical activity, or a neural implant linked to a small cluster of brain cells—all the machine does is look for an agreed-upon signal and respond to it in a preprogrammed way. This approach has been enormously successful. It’s not hard to imagine such systems allowing all of us to use our minds to operate appliances, or even send messages to one another. And that may be just the start.
当我们思考时,我们的思绪常常以一系列词语的形式出现。我们用一个特殊的短语来描述这种现象:意识流。在某种程度上,意识流的存在令人惊讶。语言是与他人沟通的系统。当没有其他人参与时,我们为什么也会用语言来组织内心的想法,这一点并不明显。但我们都这样做。
When we think, our cogitation frequently takes the form of a sequence of words. There’s a special phrase we use to describe this phenomenon: the stream of consciousness. On some level, the existence of the stream of consciousness is surprising. Words are a system for communicating with other people. It’s not obvious why we also use them to organize our internal thoughts when no other person is involved. But we all do.
从大脑的角度来看,脑机接口的神经提示与口语并无太大区别。它们只是脑细胞以某种模式进行激活而已。主要区别在于,我们不是用这些神经词与人对话,而是用它来与机器对话。人们或许会习惯于用相应的“心灵词”来陪伴内心独白,从而为作为“听众”的机器创建一个实时的隐藏字幕系统,这并非异想天开。通过这种方式与计算机合作,我们或许可以记录下自己的内心独白。
From the brain’s point of view, a neural cue to a mind-machine interface is not so different from a spoken word. It’s all just brain cells firing in patterns. The main difference is that, instead of using this neural word to talk to a person, we use it to talk to a machine. It’s not crazy to think that people might get used to accompanying their internal monologue with the corresponding mindwords, creating a real-time closed-captioning system for the benefit of the machines in their audience. By cooperating with computers in this way, it might be possible to log our inner monologue.
每一次感官体验,每一次心跳,每一次胃里的咕噜声,甚至每一个闪过脑海的想法——原则上,所有这些都是可以记录的。实际上,记录它们将以我们今天难以想象的惊人方式改变我们的生活。而且这些记录不仅会改变我们自己的生活。如果我们选择这样做,我们的生活记录将比我们自己更长久。我们将能够为孩子和亲人留下我们存在的完整记录。它们将能够从我们的成功和遗憾、智慧和愚昧中汲取教训:数字化的来世。如果你愿意,你可以将你的人生日志出售给公司,或者与科学家和学者分享。在未来的图书馆里,传记部分将不仅仅收录人们的人生故事,还将收录完整的广播内容。
Every sensory experience, every beat of our heart, every rumble in our stomach, and even every thought that crosses our mind—all these are in principle loggable. Actually logging them will change our lives in breathtaking ways that we can hardly imagine today. And these logs won’t just change our own lives. If we so choose, our life logs will outlive us. We will be able to leave a complete chronicle of our existence to children and loved ones. They will be able to learn from our triumphs and our regrets, our wisdom and our foolishness: a digital afterlife. If you were so inclined, you could sell your life log to a company, or share it with scientists and scholars. In the library of the future, the biography section won’t just have the stories of people’s lives. It will have the complete broadcast.
2013年4月15日,两枚炸弹在波士顿马拉松终点两百码处爆炸。弹片撕裂了聚集在终点线的庞大人群。三名观众丧生,数百人受伤,至少十四名受害者需要截肢。赛后几天,联邦调查局(FBI)急切地寻找线索,但证据寥寥。炸弹由高压锅制成,藏在背包里,里面装满了钉子、滚珠轴承和金属碎片。所有这些物品任何人都能轻易获得。五十万观众观看了比赛。究竟是谁安放了炸弹?这成了一场规模空前的悬疑侦查。
On April 15, 2013, two bombs exploded two hundred yards from the end of the Boston Marathon. Shrapnel tore through the massive crowds that had assembled at the finish line. Three spectators were killed. Hundreds were wounded. At least fourteen victims required amputations. In the days following the event, the FBI was desperate for clues, but there was little evidence. The bombs had been constructed from pressure cookers, hidden in backpacks, loaded with nails, ball bearings, and scraps of metal. All of these items can easily be obtained by anyone. Half a million spectators had watched the race. Which of them had planted the bombs? It was a whodunit at the largest scale imaginable.
但FBI却另有绝招:数字历史。FBI意识到,犯罪现场人山人海的优势在于:旁观者会拍照。街道两旁的商店也都配备了摄像头。如此狭小的空间里,这么多摄像头,如此短的时间内拍摄了如此多的照片,肯定有人能拍到罪犯背着背包的清晰照片。
But the FBI had a powerful trick up its sleeve: digital history. The Bureau recognized that, in one respect, the massive number of people at the scene of the crime was an advantage. Spectators take photographs. The stores that lined the street had their own cameras, too. With so many cameras in such a small space, and so many pictures being snapped in such a short period of time, surely someone would have taken a good photo of the culprit holding the backpack.
他们的预感果然不虚此行。几天之内,调查人员就公布了一段罗德与泰勒百货公司(Lord & Taylor)的监控录像,录像中清晰可见两名爆炸嫌疑人。线索纷至沓来,其中很多都是高清照片,巧合的是,这些照片捕捉到了嫌疑人的面部特征。随着照片迅速在网络上传播,爆炸嫌疑人最终展开了血腥的屠杀。其中一人在与警方的枪战中身亡,另一人落网。但他们接下来的爆炸计划——他们原本计划袭击纽约时代广场——却落空了。警告那些不法分子:无论你是谁,无论你身在何处,大数据都能追踪到你。
That hunch was right, and within days, the investigators released images from a Lord & Taylor surveillance video in which the bombers—two, it turned out—could clearly be seen. Tips started streaming in, many in the form of high-resolution photos that had, by sheer coincidence, captured the suspects’ faces. With their pictures spreading quickly across the Web, the bombers went on a final, bloody rampage. One was killed in a shoot-out with police. The other was caught. But their plans for additional bombings—they had intended to attack New York’s Times Square next—came to nothing. Bad guys be warned: Whoever you are, wherever you are, big data can track you down.
但数字化历史的作用不仅仅在于追捕坏人,它还会伤害无辜者。
But digitized history does more than hound the bad guys. It can also hurt the innocent.
2011 年 11 月,15 岁的蕾塔·帕森斯(Rehtaeh Parsons)参加派对时,据称被四名男孩强奸。男孩们拍了照片。这些照片开始通过电子邮件和脸书传播。帕森斯的同龄人非但没有支持她,反而让她的生活变成了一场噩梦。面对持续不断的欺凌,她转学了。她的家人也搬了家。她一度住院数周。但她无法逃避羞辱。她无法逃避线上线下的欺凌。她无法逃避那些永远无法消失的数码照片。2013年4月,帕森斯上吊自杀。
In November 2011, fifteen-year-old Rehtaeh Parsons went to a party, where she was allegedly raped by four boys. The boys took pictures. The pictures began to spread over e-mail and on Facebook. Instead of rallying around her, Parsons’ peers made her life into a nightmare. Faced with constant bullying, she changed schools. Her family moved. She was hospitalized for weeks at a time. But there was no escaping the shame. There was no escaping the bullying, both online and off. There was no escaping those digital pictures that would never go away. In April 2013, Parsons hanged herself.
摄影从诞生之日起就一直受到一种有点奇怪的迷信的困扰:通过记录你的图像,相机窃取你灵魂的一小部分。这个想法很有道理。正如我们刚才看到的,仅仅拥有一个人的一张照片就能赋予你某种控制权。大数据会直接窃取你的灵魂吗?
Photography, from its inception, has been dogged by a somewhat peculiar superstition: that by recording your image, a camera steals a tiny part of your soul. There is something to that idea. As we just saw, having just a single picture of someone can give you a form of power over that person. Will big data steal your soul outright?
这是一个亟待解决的问题。过去,为了给后世留下一些东西,需要付出刻意的努力,因此记录下来的资料非常少。但我们已经取得了长足的进步,不再只是将数据刻在岩石上。很快,追踪我们经历的很多事情将变得轻而易举,以至于我们中的许多人会发现,默认记录所有事情会更简单。将某些内容隐藏起来需要深思熟虑的选择。因此,保存信息正从一个技术难题变成一个道德困境。而这个困境最终取决于几个关键问题:哪些内容应该被记录在案?如果有记录,谁有权访问它?
This is an urgent question. Because it used to take a deliberate effort to preserve something for posterity, very little was recorded. But we’ve come a long way from carving our data on a rock. Soon it will become so easy to track much of what we experience that many of us will find it simpler to just record everything by default. It will take a deliberate choice to keep something off the record. As a result, preserving information is changing from a technological puzzle into a moral dilemma. And the dilemma turns on a small handful of issues. What are the things that belong offlog? And if there is a log, who has the right to access it?
很难预测这些问题将如何解答,因为推测科技的未来远比推测价值观的未来容易得多。以德韦恩·罗伊为例。即使动机是为了推动科学发展,一个两岁男孩的隐私真的比美国总统还少吗?很多人会反对以这种方式被记录下来。但社交网络正在以惊人的速度改变着社会分享的规范。我们今天在网上分享的很多东西,在二十年前,甚至五年前,都会受到严密保护。也许德韦恩这一代的孩子不会介意。也许他们都会认为,没有成长时期的生活日志是极其原始的行为。
It’s hard to tell how these questions will be answered, because it is much easier to speculate about the future of our technology than about the future of our values. Take Dwayne Roy’s case. Even if the motivation is to advance science, is it really right that a two-year-old boy has less privacy than the president of the United States? Many people would object to being documented in that way. But the social web is transforming communal norms about sharing at a shocking pace. Lots of the things that we share online today would have been closely guarded twenty years ago, or even five years ago. Perhaps the kids of Dwayne’s generation won’t mind. Perhaps they will all think it hopelessly primitive not to have a life log of one’s formative years.
尽管如此,你可以说我们老派,但正如生活记录即将成为现实一样,我们也同样清楚,公开生活记录是一个非常危险的概念。当然,营销人员会利用它们继续向我们推送烦人的广告。连锁店塔吉特(Target)已经可以利用其数据分析来判断哪些顾客怀孕了。有一次,塔吉特的优惠券将一名少女怀孕的消息泄露给了她毫无戒心的父母。如果营销人员和跨国公司能够不受监管地访问生活日志,情况会变得多么令人不快,这可想而知。
Still, call us old-fashioned, but just as it seems apparent that life logging will become possible, it seems equally apparent to us that public life logs are a very dangerous concept. Marketers, of course, will use them to continue flooding us with annoying advertisements. Already, the chain store Target can use its data analytics to figure out which of its customers is pregnant. On one occasion, Target coupons broke the news of a teenager’s pregnancy to her unsuspecting parents. One can only imagine how unpleasant this would get if marketers and global corporations had unregulated access to life logs.
然而,企业干预或许并非我们最担心的问题。政府可以利用生活记录随时追踪所有公民。谷歌和脸书等公司已经在国家安全受到威胁时向联邦政府公开其记录。有时,无论公司是否愿意,政府都能设法获取这些记录。2012年9月,纽约刑事法庭迫使推特交出“占领华尔街”抗议者之一马尔科姆·哈里斯的私人推文。2013年,爱德华·斯诺登泄密事件引发全国公愤,促使奥巴马总统向美国人民保证“没有人在监听你们的电话”。合法的公共利益和“老大哥”之间的界限在哪里?它必须存在。在一个政府可以随时调取任何人生活记录的世界里,抵抗真的是徒劳的。
Yet corporate interference may not be the worst of our concerns. A government could use life logging to track all citizens, all the time. Already, companies like Google and Facebook open their records to the federal government when national security is at stake. Sometimes, the government manages to get at the records whether the company likes it or not. In September 2012, Twitter was forced by a New York criminal court to hand over the private tweets of Malcolm Harris, one of the Occupy Wall Street protesters. In 2013, the Edward Snowden leaks unleashed national outrage, prompting President Obama to reassure Americans that “nobody is listening to your telephone calls.” Where is the line between legitimate public interest and Big Brother? It must exist. In a world where the government can subpoena anyone’s life log, anytime, resistance really is futile.
更糟糕的是,如果思维记录技术真的可行,人们可以想象到的反乌托邦世界将会是什么样子。比如说:一个极权政府可能会强迫每个人无时无刻地记录下所有想法。在思维记录中留下空白会受到惩罚,私人想法将成为过去。这甚至还不是最可怕的场景。想象一下,如果政府强制执行思维记录,要求公民一遍又一遍地记录特定的想法,就像小学生背诵效忠誓词或教义问答一样。被困在强制性的……意识流中,公民将成为自己思想的囚徒。
Worse still are the dystopias one can imagine if mind logging ever becomes technically feasible. Here’s one: A totalitarian government might force everyone to log every thought, all the time. Blank entries in the mind log would be punished, and private thoughts would become a thing of the past. That’s not even the most terrifying scenario. Imagine if a government enforced a mandatory mind log, requiring citizens to transcribe specific thoughts, over and over, the way schoolchildren might recite the Pledge of Allegiance or a catechism. Trapped in a compulsory stream of consciousness, citizens would become prisoners of their own minds.
这些都是令人担忧的问题。尽管生活记录技术尚处于萌芽阶段,但人们已经看到了一场反抗运动的苗头。在西雅图,5 Point Café 的老板们担心,生活记录技术的出现会阻碍顾客们参与他们惯常的、随心所欲的恶作剧。缺少恶作剧显然对生意不利,所以这家酒吧禁止了谷歌眼镜的使用。一家名为Snapchat 提供一项服务,允许用户发送消息,这些消息会在指定时间后删除。随着生活记录的日益普及,这将催生对离线空间、离线时间和离线互动的需求。
These are huge concerns. But even though life logging is still only a nascent possibility, one can already begin to see the seeds of a countermovement. In Seattle, the owners of the 5 Point Café worry that the presence of life-logging technology will discourage customers from engaging in their typical, freewheeling shenanigans. An absence of shenanigans would obviously be bad for business, so the bar has banned Google Glass. A Web start-up called Snapchat offers a service allowing users to send messages that are deleted after a specified length of time. As life logging becomes increasingly common, it will create the need for offlog spaces, offlog times, and offlog interactions.
我们的生活投下数字阴影。争夺这些巨大阴影的斗争,争夺拥有我们个人历史的权利以及控制谁有权访问这些历史的权利,已经打响。数字公共领域会发展成为一个广阔而奇妙的游乐场吗?会成为执法的有力工具吗?会成为无数代人的经验和道德遗产吗?还是会成为监控国家的支柱?这场较量将成为下个世纪最重大的道德冲突之一。
Our lives cast digital shadows. The battle for those big shadows, the right to own our personal history and to control who has access to it, is already met. Will the digital commons grow up to be a vast and wondrous playground? A powerful tool for law enforcement? The experiential and moral legacy of countless generations? Or the backbone of a surveillance state? This contest will be one of the great moral conflicts of the coming century.
伽利略的望远镜——两个背对背的镜头——标志着我们文明史的一个转折点。他所看到的景象与天主教教义相悖。由于他的不公,宗教裁判所将他软禁,并在那里度过了余生。但教会无法阻止他的思想。在伽利略之后——而且在一小部分原因是因为他——教会对西方思想的长期统治开始衰落。
Galileo’s telescope—two lenses, back to back—marked a turning point in the history of our civilization. What he saw contradicted Catholic Church doctrine. For his trouble, the Inquisition put him under house arrest, where he remained for the rest of his life. But the Church could not arrest his ideas. After Galileo—and in no small part because of him—the Church’s lengthy dominion over the Western mind began to ebb.
取而代之的是两大思想传统。其一是自然科学,其使命是通过经验观察来确定宇宙的本质。其二是人文学科,其使命是通过审慎的批判性分析来研究人性。这两大学科共同为西方文明贡献了许多强大的力量,从自由民主到工程技术。
In its place, two great intellectual traditions took root. One was the sciences, tasked with determining the nature of the universe by means of empirical observation. The other was the humanities, the study of human nature through careful, critical analysis. Together, these two siblings have given many powerful gifts to Western civilization, from freedom and democracy to engineering and technology.
然而,这些强大的“兄弟”早已疏远。即使在今天,一个典型的学生也必须在科学和人文之间做出选择;很少有专业或学位课程能够横跨两者。一个典型的研究人员也必须与其中一个群体结盟。这些界限早已被刻画在我们的学校、大学以及整个知识生态系统中。我们学习数学。我们学习莎士比亚。但我们从未一起学习过。
Yet these mighty brethren have long been estranged. Even today a typical student must choose to focus on either the sciences or the humanities; rare is the major or degree program that spans the two. A typical researcher, too, must ally with one group or the other. The boundaries have long been encoded into our schools, our universities, and our entire knowledge ecosystem. We study math. We study Shakespeare. But not together.
至少直到最近才如此。在斯坦福大学,一位名叫弗朗哥·莫雷蒂的意大利学者开始利用电子书的浪潮,研究莎士比亚作品中人物的互动网络,将计算机科学和统计物理学的方法应用于一个全新的领域。内布拉斯加大学文学教授马修·乔克斯能够根据一些看似深奥的东西,比如小说中代词的统计分布,来识别十九世纪小说之间的微妙关系。在美国国家人文基金会,布雷特·鲍勃利领导着一个名为“挖掘数据挑战”的创新项目,该项目帮助全美各地的人文学者批判性地思考所有这些新数据能为他们带来什么。他们正在探索数学从未涉足的领域。
At least not until recently. At Stanford, an Italian scholar named Franco Moretti has started using the onslaught of digital books to study the interaction network of characters in Shakespeare, applying methods and approaches from computer science and statistical physics in a radically new domain. Matthew Jockers, a literature professor at the University of Nebraska, is able to identify subtle relationships between nineteenth-century novels based on things as seemingly esoteric as the statistical distribution of the pronouns that they contain. At the National Endowment for the Humanities, Brett Bobley heads an innovative program called the Digging into Data Challenge, which helps humanists all over the United States think critically about what all this new data can do for them. They are going where no math has gone before.
达特茅斯学院的情况则不同,那里的数学家丹尼尔·洛克莫尔 (Daniel Rockmore) 一直在利用电子书研究不同作者的风格如何相互影响。他运用的数学知识比莫雷蒂多得多,阅读量却少得多。但两人志趣相投。德克萨斯大学奥斯汀分校的心理学家詹姆斯·彭尼贝克 (James Pennebaker) 也在研究文本中代词的分布如何反映作者的情绪。彭尼贝克和乔克斯的思想传统截然不同,但他们也是志趣相投的人。白宫科技政策办公室的汤姆·卡利尔 (Tom Kalil) 则受奥巴马总统亲自委托,牵头开展一项大数据计划。卡利尔和鲍勃利的资助对象并非同一批人,但他们也是志趣相投的人。
Except at Dartmouth, where a mathematician named Daniel Rockmore has been using digital books to study how authors’ styles influence one another. He uses much more math than Moretti, and much less reading. But the two are kindred spirits. Or at the University of Texas at Austin, where psychologist James Pennebaker has been studying how the distribution of pronouns in a text reflects the mood of the author. Pennebaker and Jockers come from completely different intellectual traditions, but they, too, are kindred spirits. Or at the White House’s Office of Science and Technology Policy, where Tom Kalil is spearheading a big data initiative at the behest of President Obama himself. Kalil and Bobley don’t fund the same people. But they are kindred spirits as well.
随着历史记录性质的变化,它正在模糊科学与人文学科之间的界限。由此产生的混杂学科有着许多名称。从事此类研究的历史学家倾向于称自己为“数字人文学者”。语言学系有“语料库语言学家”。心理学家和社会学家有时更喜欢使用“计算社会科学家”一词。在一家又一家硅谷初创企业中,这种酝酿已久的概念性变革已是家常便饭。
As the nature of the historical record changes, it is scrambling the boundaries between science and the humanities. The resulting mishmash goes by many names. Historians who do this sort of thing are apt to call themselves “digital humanists.” Linguistics departments have “corpus linguists.” Psychologists and sociologists sometimes prefer the term “computational social scientist.” And in one Silicon Valley start-up after another, this simmering conceptual chulent is just business as usual.
渐渐地,来自这一深刻裂痕各方的思想正在汇聚在一起。2013年春,在马里兰州举行的一次学术会议上,美国国立卫生研究院、美国国家人文基金会和美国国家医学图书馆召集了一群研究人员,他们涵盖了从艺术史到非洲语言到计算机科学,从微生物学到修辞学到诗学再到动物学等众多学科。大卫·瑟尔斯,制药巨头前高级副总裁葛兰素史克公司(GlaxoSmithKline)发表了主题演讲。这是美国国立卫生研究院(NIH)和美国国家人文基金会(NEH)首次联手主办会议。会议主题“数据、生物医学和数字人文”展现出一种惊人的乐观:历史学家、哲学家、艺术家、医生和生物学家共同思考数据,能够比任何一个人单独行动更好地推进各自的事业。会议主题“共享视野”恰如其分。在我们所有工作的交汇处,蕴藏着我们知识未来中最激动人心的领域。
Slowly, minds from all sides of this deep rift are coming together. At an academic conference in Maryland during the spring of 2013, the National Institutes of Health, the National Endowment for the Humanities, and the National Library of Medicine convened a group of researchers spanning an astonishing range of disciplines, from art history to African languages to computer science, from microbiology to rhetoric to poetics to zoology. David Searls, former senior vice president at pharmaceutical giant GlaxoSmithKline, gave the keynote address. It was the first time that the NIH and the NEH had ever gotten together to sponsor a conference. The topic, “Data, Biomedicine, and the Digital Humanities,” betrays an astonishing optimism: the idea that historians and philosophers and artists and doctors and biologists, thinking about data together, can advance their individual causes better than any of them can alone. The conference title, “Shared Horizons,” was dead-on. At the interface of all our work lies the most exciting terrain in our intellectual future.
没人知道该如何称呼它,也没人知道它的未来走向。但有一件事是肯定的:科学与人文学科正在再次成为精神上的契合。正如伽利略在17世纪改变了我们对世界的理解一样,这两个镜头,背靠背,也将在21世纪产生同样的效果。
No one knows quite what to call it. And no one knows quite where it’s going. But one thing is certain: Science and the humanities are becoming, once again, kindred spirits. And just as Galileo transformed our understanding of our world in the seventeenth century, these two lenses, back to back, will do the same in the twenty-first.
盖尔·多尼克(Gaal Dornick)使用非数学概念,将心理史学定义为数学的一个分支,它研究人类群体对固定的社会和经济刺激的反应......
Gaal Dornick, using nonmathematical concepts, has defined psychohistory to be that branch of mathematics which deals with the reactions of human conglomerates to fixed social and economic stimuli. . . .
所有这些定义都隐含着这样的假设:所处理的人类群体足够大,可以进行有效的统计处理……另一个必要的假设是,人类群体本身不知道心理史学分析,因此它的反应确实是随机的……
Implicit in all these definitions is the assumption that the human conglomerate being dealt with is sufficiently large for valid statistical treatment. . . . A further necessary assumption is that the human conglomerate be itself unaware of psychohistoric analysis in order that its reactions be truly random. . . .
—艾萨克·阿西莫夫,《基地》
—Isaac Asimov, Foundation
在科幻小说中最著名的作品之一《基地》中,艾萨克·阿西莫夫想象了一位名叫哈里·谢顿的数学家。谢顿的伟大贡献是,他通过将复杂的数学理论与对特定时刻社会状况的详细测量相结合,找到了预测未来的方法。当然,谢顿无法知道某个人会做什么:个体行为的随机性太强。但他可以推断出整个社会将会做什么。例如,谢顿推断出统治银河系一千多年的帝国即将覆灭。谢顿的理论并没有告诉他究竟是谁会做什么导致帝国的覆灭,但它确实告诉他帝国的覆灭迫在眉睫,并且会留下一片混乱。
In one of the most famous books in all of science fiction, Foundation, Isaac Asimov imagines a mathematician named Hari Seldon. Seldon’s great contribution is that he figures out how to predict the future by combining elaborate mathematical theories with detailed measurements about the state of society at any given moment in time. Of course, Seldon can’t know what a particular person will do: Individual people are too random. But he can figure out what society as a whole will do. For instance, Seldon figures out that the Empire, which has ruled the galaxy for more than a millennium, will soon fall. Seldon’s theory doesn’t tell him exactly who will do exactly what to bring about the fall, but it does tell him that the fall is imminent, and that it will leave chaos in its wake.
这类关于聚集行为的理论在科学界并不少见。想象一下,当你给气球充气,然后不打结就松手时会发生什么。小孩子知道,空气会从开口处流出,气球放气时会飞走,最终落到地面。物理学家可以做得更好,计算出空气分子从洞中逸出的速率、放气的速度以及气球在空中飞驰的速度。但世界上没有一位科学家能够告诉你气球中各个气体分子会以什么样的顺序飞出:单个分子的随机性太强了。气球及其所含的空气遵循着可预测的模式,但只有当它们作为一个整体来考虑时才会如此。
Such theories of aggregate behavior are not uncommon in the sciences. Consider what happens when you inflate a balloon and then, without tying the knot, let go. A small child learns that air will start flowing out of the opening and that, as the balloon deflates, it will fly away, eventually falling to the ground. A physicist could do better, calculating the rate at which air molecules spill out of the hole, the pace of deflation, and the speed of the balloon as it whizzes through the air. But no scientist in the world can tell you in what order the individual gas molecules in the balloon will hurtle out: Single molecules are far too random. The balloon, along with the air it contains, follows a predictable pattern, but only when considered in aggregate.
阿西莫夫的想法——他称之为心理史学——是这种方法可能使我们从总体上预测人类文明的未来。
Asimov’s idea—which he dubbed psychohistory—was that such an approach might make it possible to predict the future, in aggregate, of human civilization.
对于当代社会科学家来说,这种充满热情的文化决定论或许听起来完全陌生。大多数领域——经济学是个显著的例外——都不太相信这一概念。这多少有些令人惊讶,因为阿西莫夫的概念实际上是社会科学的元老学说。19世纪初,社会学之父、社会科学的奠基人奥古斯特·孔德相信,细致的实证研究最终将揭示支配人类社会运作的规律,就像仔细研究物理现象揭示了背后的数学原理一样。他最初将这门后来被称为社会学的学科命名为“社会物理学”。孔德认为,理解社会学规律将使我们能够利用它们创造一个更美好的社会,就像理解物理学可以用来制造更好的烤面包机一样。当阿西莫夫笔下的哈里·谢顿基于心理史学计算采取行动以尽量减少银河系混乱时,他正是孔德幻想的虚构化身。
To a contemporary social scientist, this enthusiastic brand of cultural determinism may seem utterly foreign. It’s a notion that most fields—economics is a notable exception—give little credence. That’s a bit surprising, because Asimov’s concept is actually the ur-doctrine of social science. In the early nineteenth century, Auguste Comte, the father of sociology and the founder of the social sciences, believed that careful empirical study would eventually reveal the laws that governed the operation of human society, in the same way that careful study of physical phenomena had revealed underlying mathematical principles. His original name for the discipline that he later dubbed sociology was social physics. Comte believed that understanding the laws of sociology would make it possible to use them to create a better society, much in the way that an understanding of physics can be used to build a better toaster. When Asimov’s Hari Seldon, on the basis of psychohistorical calculations, takes actions to minimize galactic chaos, he is the fictional embodiment of Comte’s fantasy.
当我们想到即将席卷社会科学的数据浪潮时,我们很容易想象,有了如此多的数据,孔德的梦想或许就能实现。
It is very tempting, when thinking about the tidal wave of data that will soon break over the social sciences, to imagine that, with so much data, Comte’s dream might be within reach.
另一方面,试图在历史趋势发生之前进行预测似乎完全是疯狂的。
On the other hand, attempting to predict historical trends before they happen seems completely nuts.
因此,我们决定利用 ngram 做最后一个实验,目的是检验历史趋势是否可预测。我们测试了一种最简单的预测,我们称之为文化惯性。我们所说的文化惯性是指,正在上升的 ngram 倾向于持续上升,而正在下降的 ngram 倾向于持续下降。股市并不表现出惯性:如果股市表现出惯性,那么任何人都可以大赚一笔。作为投资者。如果人类文化表现出惯性,那么我们可以通过研究 ngram 刚刚做的事情来了解它下一步会做什么。
So, using ngrams, we decided to do one last experiment, whose goal was to check whether historical trends might be predictable. We tested the simplest possible prediction, something we call cultural inertia. All we mean by cultural inertia is that ngrams that are going up will tend to keep going up, and ngrams that are going down will tend to keep going down. The stock market doesn’t exhibit inertia: If it did, anyone could make a killing as an investor. If human culture exhibits inertia, then we can learn a lot about what an ngram will do next by examining what it just did.
这是机器人绘制的图表:
Here is the chart that the robot drew:
浅灰色部分显示的是大量 ngram 的平均频率,之所以选中这些 ngram,是因为它们在二十年间呈现持续下降趋势。二十年结束后,这种趋势还会持续吗?的确如此,而且会持续几十年。深灰色部分显示的是相反的情况,即二十年间持续上升的 ngram 集合。它们的急剧上升趋势持续了近一个世纪——只要我们能够测量的话。所以,你看:上升的 Ngram 往往会继续上升。下降的 Ngram 往往会继续下降。更普遍地说:运动中的 Ngram 往往会保持运动状态(除非受到心理历史力量的作用)。
In light gray, we show the average frequency of a large number of ngrams that were chosen for inclusion because they show a consistent decline over a twenty-year period. Does the trend continue when the period has ended? It does, for decades thereafter. In dark gray, we examine the reverse, a collection of ngrams that increased consistently over a twenty-year period. Their dramatic ascent continues for nearly a century—for as long as we can measure. So there you have it: Ngrams that are going up tend to keep going up. Ngrams that are going down tend to keep going down. More generally: Ngrams in motion tend to stay in motion (unless acted on by a psychohistorical force).
也许,只是也许,历史预测科学是可能的。或许,只是或许,我们的文化遵循着确定性规律。又或许,只是或许,我们所有的数据正将我们引向那条道路。
Maybe, just maybe, a predictive science of history is possible. Maybe, just maybe, our culture obeys deterministic laws. And maybe, just maybe, that is where all of our data is taking us.
但即使这样的理解是可能的,它真的是我们想要的吗?孔德认为如此。他认为,如果没有客观的衡量,没有可证伪的预测,我们对人类历史、社会和文化的理解将会极其贫乏。人类学家弗朗兹·博厄斯对此持不同意见:
But even if such an understanding is possible, is it really what we want? Comte thought so. He believed that without objective measurement, without falsifiable predictions, our understanding of human history, society, and culture would be deeply impoverished. The anthropologist Franz Boas disagreed:
物理学家比较一系列相似的事实,从中提炼出普遍存在的普遍现象。从此以后,单个事实对他来说变得不那么重要了,因为他只关注普遍规律。
The physicist compares a series of similar facts, from which he isolates the general phenomenon which is common to all of them. Henceforth the single facts become less important to him, as he lays stress on the general law alone.
另一方面,事实是历史学家重视和感兴趣的对象。......
On the other hand, the facts are the object which is of importance and interest to the historian. . . .
这两种方法哪一种更有价值?答案只能是主观的。……
Which of the two methods is of a higher value? An answer can only be subjective. . . .
简而言之:有时你想看图表,有时又想蜷缩着读一本好书。欢迎来到数字化的未来,见证历史。何不两者兼顾?
In short: Sometimes, you want to look at a chart. Other times, you want to curl up with a good book. Welcome to history in our digital future. Why not try both?
2010年12月16日下午2点,我们关于文化组学的文章出现在了网上,谷歌推出了Ngram Viewer,我们俩都松了一口气:终于完成了!
At 2:00 p.m. on December 16, 2010, our article on culturomics appeared online, Google launched the Ngram Viewer, and the two of us breathed a sigh of relief: we were finally done!
我们的休息一直持续到下午5点40分,这时马克斯·布罗克曼——现在是我们的经纪人——给我们发了一封邮件,标题很简单,就是“你的书”。马克斯,谢谢你发了这封邮件。另外,我们都是细节至上的完美主义者,拿到一朵百合花,立刻就想把它镀金、蘸上巧克力、炸成金黄色。谢谢你一次又一次地为我们这个疯狂的书法创意出力。
Our break lasted until exactly 5:40 p.m., when Max Brockman—now our agent—sent us an e-mail titled, simply, “your book.” Thank you, Max, for sending us that e-mail. Also: we are detail-obsessed perfectionists who, given a lily, immediately want to gild it, dip it in chocolate, and have it deep-fried. Thank you for going to bat, time and time again, for our crazy idea of a book.
如果没有我们编辑劳拉·佩尔恰塞佩(Laura Perciasepe)的非凡努力,这本书现在可能只是一个疯狂的想法。她竭尽全力地将这本书变成了现实。她源源不断地提供灵感、反馈和阅读任务。每次收到她寄来的新平装书,总是令人兴奋不已,就像一个非正式的“每月一书俱乐部”,从最根本的层面塑造了这本书。我们也非常感谢Riverhead出版社的设计师和文字编辑们,我们一直不厌其烦地联系着他们;也感谢我们的公关凯蒂·弗里曼(Katie Freeman)。
This book would still be no more than a crazy idea if not for the extraordinary efforts of our editor, Laura Perciasepe, who fought extremely hard to make it a reality. She was a constant source of ideas, feedback, and reading assignments. It was always exciting to get a random package in the mail with some new paperback that she wanted us to read, an informal book-of-the-month club that shaped the present volume on the most fundamental level. We also owe a deep debt of gratitude to the designers and copy editors at Riverhead, whom we pestered to no end; and to our publicist, Katie Freeman.
其他人也对本书产生了很大的影响。Julie Zauzmer 尤为突出。她无数次地阅读了文本,从文本的整体结构到各个逗号的位置,她对《神秘海域》各个方面的想法都对文本的形成起到了至关重要的作用。John Bohannon、Neva Cherniavsky Durand 和 Jan Zauzmer 也慷慨地多次审阅了文本;他们三位都贡献了深刻的见解和鼓励。我们还要感谢 Samuel Arbesman、Ivan Bochkov、Pedro Bordalo、Andrea Bress、Elisheva Carlebach、Olga Dudchenko、Yitzie Ehrenberg、Sue Lieberman、Oliver Medvedik、Arina Omer、Suhas Rao、Benjamin Schmidt 和 Elena Stamenova 对草稿的评论。
Other people had a strong influence on this book as well. Julie Zauzmer stands out. She read the text countless times, and her ideas about everything from the overall structure of the text to the position of individual commas helped shape Uncharted at every scale. John Bohannon, Neva Cherniavsky Durand, and Jan Zauzmer were also gracious enough to review the text numerous times; all three contributed deeply perceptive insights and encouragement. We are also grateful to Samuel Arbesman, Ivan Bochkov, Pedro Bordalo, Andrea Bress, Elisheva Carlebach, Olga Dudchenko, Yitzie Ehrenberg, Sue Lieberman, Oliver Medvedik, Arina Omer, Suhas Rao, Benjamin Schmidt, and Elena Stamenova for commenting on drafts.
科学是一种对话。本书中的想法是众多优秀合作者参与的对话的成果,无法一一列举。为了证明这一点,以下列出其中一部分名字:Aviva Aiden、Uri Alon、John Bohannon、Martin Camacho、Nicholas Christakis、Robert Darnton、Daniel Donoghue、Neva Cherniavsky Durand、Sara Eismann、George Fournier、Joseph Fruchter、Anthony Grafton、Jo Guldi、Joe Jackson、Eric Lander、Carol Lazell、Mark Liberman、Yuri Lin、Micheal Lopez、Sarah Johnson、Michael McCormick、Radhika Nagpal、Jeremy Rau、Charles Rosenberg、Tracey Robinson、Jonathan Saragosti、Benjamin Schmidt、Jesse Sheidlower、Yuan Shen、Stuart Shieber、Randy Stern、Tina Tang、Werner Treß、Adrian Veres、Ben Zimmer;美国传统词典的 Joe Pickett ;感谢大英百科全书的 Jorge Cauz、Carmen-Maria Hetrea、Dale Hoiberg 和 Kunal Sen ;感谢谷歌图书团队的全体成员,特别是 Ben Bayer、Dan Bloomberg、Will Brockman、Ben Bunnell、Dan Clancy、Matt Gray、Peter Norvig、Jon Orwant、Slav Petrov、Ashok Popat、Leonid Taycher、Leslie Yeh,尤其是 Alfred Spector。本文及注释中摘录了对话片段,重点介绍了许多关键贡献者;但我们收录的每则轶事都少了六则,这令我们深感遗憾。马丁·诺瓦克和史蒂芬·平克值得再次特别感谢:他们是我们工作的重要催化剂。
Science is a conversation. The ideas in this book are the fruits of a conversation that has involved too many wonderful collaborators to list. To prove it, here are but a few of their names: Aviva Aiden, Uri Alon, John Bohannon, Martin Camacho, Nicholas Christakis, Robert Darnton, Daniel Donoghue, Neva Cherniavsky Durand, Sara Eismann, George Fournier, Joseph Fruchter, Anthony Grafton, Jo Guldi, Joe Jackson, Eric Lander, Carol Lazell, Mark Liberman, Yuri Lin, Micheal Lopez, Sarah Johnson, Michael McCormick, Radhika Nagpal, Jeremy Rau, Charles Rosenberg, Tracey Robinson, Jonathan Saragosti, Benjamin Schmidt, Jesse Sheidlower, Yuan Shen, Stuart Shieber, Randy Stern, Tina Tang, Werner Treß, Adrian Veres, Ben Zimmer; Joe Pickett of the American Heritage Dictionary; Jorge Cauz, Carmen-Maria Hetrea, Dale Hoiberg, and Kunal Sen of the Encyclopædia Britannica; at Google, the entire Books team, notably Ben Bayer, Dan Bloomberg, Will Brockman, Ben Bunnell, Dan Clancy, Matt Gray, Peter Norvig, Jon Orwant, Slav Petrov, Ashok Popat, Leonid Taycher, Leslie Yeh, and especially Alfred Spector. Snippets of this conversation, highlighting many key contributors, appear throughout the text and notes; but for each anecdote we include, there are a half-dozen more we regret having had to leave out. Martin Nowak and Steven Pinker deserve to be singled out again: they have been essential catalysts of our work.
我们,以及我们的分析,不过是我们所读书籍的总和。我们感谢每一位将自己的名字和声誉投入到这门最古老的艺术中的人。
We, and our analyses, are little more than the sum of the books that we have read. We are grateful to everyone who has staked their name and reputation in this, most ancient art.
—Erez 和 JB
—Erez and JB
我感谢很多人。
I am grateful to many people.
海伦·苏丹尼克(Helen Sultanik)在六年级时教我科学。乔尔·沃洛韦尔斯基(Joel Wolowelsky)教会我数学的优雅。丹·埃谢尔(Dan Eshel)让我在他的实验室里玩耍。约翰·霍普金斯大学城市青年项目(CTY)让我认识了其他“呆子”。就这样,我逐渐爱上了科学。
Helen Sultanik taught me science in sixth grade. Joel Wolowelsky taught me the elegance of mathematics. Dan Eshel let me play in his laboratory. The Johns Hopkins CTY program introduced me to other dorks. In these ways, I grew to love science.
塞缪尔·科恩、罗伯特·冈宁、索尔·克里普克和保罗·西摩在我读本科时都给予我很大的鼓励。威尔·哈珀曾经说过一句话,他自己可能不记得了,但我永远不会忘记——这是我在如何选择研究课题方面得到的最好的建议。我的硕士论文导师埃利舍娃·卡勒巴赫让我领略了历史学家的人生。马丁·诺瓦克教授也对我进行了深入的指导。翼,教我如何写科学论文,教我拥抱幽默,并且相信我。埃里克·兰德鼓励我大胆创新。史蒂芬·平克在几乎没有人认真对待我们的想法的时代,认真对待我们的想法。
Samuel Cohen, Robert Gunning, Saul Kripke, and Paul Seymour encouraged me as an undergraduate. Will Happer once said something he will not remember but that I will not forget—the best advice I ever got about how to choose a problem. Elisheva Carlebach, my master’s thesis advisor, gave me a taste of the historian’s life. Professor Martin Nowak took me under his wing, taught me how to write a scientific paper, taught me to embrace my humor, and believed in me. Eric Lander challenged me to be bold. Steven Pinker took our ideas seriously at a time when almost no one else did.
劳伦斯·戴维 (Lawrence David) 和格伦·韦尔 (Glen Weyl) 被要求就文化组学进行大量的讨论,哈佛研究员学会的许多其他成员也同样如此。
Lawrence David and Glen Weyl got saddled with a lot of yakking about culturomics, as did many other members of the endlessly interesting Harvard Society of Fellows.
Michael Berger、John Bohannon、Avi Bossewitch、Neal Dach、Sarah Johnson、Ari Packman 和 Nicholas Christakis 确实给了我很大的帮助。
Michael Berger, John Bohannon, Avi Bossewitch, Neal Dach, Sarah Johnson, Ari Packman, and Nicholas Christakis have really been there for me.
我的实验室:Ivan Bochkov、Martin Camacho、Ashok Cutkosky、Olga Dudchenko、Neva Cherniavsky Durand、Zach Frankel、Maxim Massenkoff、Matt Nicklay、Arina Omer、Suhas Rao、Adrian Sanborn、Benjamin Schmidt、Elena Stamenova 和 Linfeng Yang;以及他们之前的 ROMEan,一直追溯到 Joe Jackson,他的本科论文激发了这一切。你们让科学变得有趣。
My lab: Ivan Bochkov, Martin Camacho, Ashok Cutkosky, Olga Dudchenko, Neva Cherniavsky Durand, Zach Frankel, Maxim Massenkoff, Matt Nicklay, Arina Omer, Suhas Rao, Adrian Sanborn, Benjamin Schmidt, Elena Stamenova, and Linfeng Yang; and the ROMEans who preceded them, all the way back to Joe Jackson, who spurred all of this with his undergraduate thesis. You guys make science fun.
当然,如果没有我尊敬的合著者,任何科学战友的名单都是不完整的。JB,与你共事近十年是一段不可思议的经历。一路上有如此多的发现,也充满了乐趣。
Of course, no listing of scientific comrades-in-arms would be complete without my esteemed coauthor. JB, working with you for nearly a decade has been an incredible experience. So many discoveries, and so much fun along the way.
我感恩拥有一个充满爱的家庭。我的姐妹们,塔玛、帕蒂和奥利;她们的丈夫,欧里、大卫和埃迪;还有孩子们,本、丹尼、伊利亚娜、艾什利、吉尔、艾萨克、诺亚、奥伦和佐伊。吉尔:感谢你多年来的校对。
I am grateful for a loving family. My sisters, Tamar, Pattie, and Orly; their husbands, Ouri, David, and Eddie; and children, Ben, Danny, Eliana, Eshli, Gil, Isaac, Noah, Oren, and Zoë. Gil: thanks for all the proofreading over the years.
我深深感谢我的母亲苏·利伯曼。首先,我存在。其次,她一直希望我过得最好,比如我五岁的时候,她就想让我去史岱文森高中读书。坚持住,伊玛。我知道最近过得很艰难。
I am deeply thankful to my mother, Sue Lieberman. First, I exist. Second, she has always wanted the best for me, like when I was five years old and she tried to enroll me in Stuyvesant High School. Hang in there, Ima. I know it has been a rough ride of late.
我感谢我的孩子们,加布里埃尔·伽利略、玛雅·阿玛拉和阿拉贡·巴纳纳,他们让我的生活充满欢笑和乐趣。我的妻子阿维娃在这八年里比我想象的还要好。在我写这本书的时候,她一直是我的坚强后盾。她才华横溢、善良、耐心、善解人意、支持我,是我科学上的伙伴,更重要的是,也是我生活中的伙伴。这真是一段奇妙的旅程。
I am thankful to my children, Gabriel Galileo, Maayan Amara, and Aragorn Banana, who fill my life with laughter and fun. My wife, Aviva, has been more wonderful than I could imagine for these eight years. She has been a rock while I wrote this book. She is brilliant, kind, patient, understanding, and supportive, a partner in science and, much more important, in life. It has been such a marvelous adventure.
我希望我能感谢我的父亲阿哈龙·利伯曼。他去世的时候,我们正在写这本书,实际上,就在我们写我最想和他讨论的那一章的时候。他是一位天才,也是一位伟大的作家。发明家,他给了我很多礼物,但没有什么比他对我未来发展的不懈热情更伟大的了。
I wish I could thank my father, Aharon Lieberman, . He passed away while we were writing this book, indeed while we were writing the chapter I most wanted to discuss with him. He was a genius, and a great inventor, and he gave me so many gifts, but none greater than his relentless enthusiasm for who I might become.
这本书很大程度上归功于他。用我父亲的话来说,他是一个“职业难民”,他一直在英语中摸索,却从未真正地把它当成自己的家。我九岁的时候,他让我坐在书桌前,让我为他的公司写点东西。我写得非常非常糟糕。他不可能写得更好了,但他知道,凭借母语为母语的我,一定可以。于是,他详尽地分析了我的失败,并告诉我要从头再来一次。一次又一次。我们就这样工作了八年,一个项目接一个项目,直到我离开家去上大学。就这样,这位最意想不到的老师教会了我写作。
This book owes him a great debt. My father was, in his words, a “professional refugee,” and he stomped around the English language without ever quite making it his home. When I was nine years old, he sat me down at a desk and made me write something for his company. I wrote it, very very badly. He couldn’t have done it any better, but he knew that I, with the benefit of a native tongue, could. So he dissected my failure, in lengthy and unfaltering detail, and told me to do it again, from scratch. And again. And again. We worked in this way for eight years, on one project after another, until I left to go to college. And in this manner, the unlikeliest of teachers taught me how to write.
对我父亲来说,平庸是一种道德败坏。我想念他,感谢他。祈祷他会喜欢这本书。
To my father, mediocrity was a moral failure. I miss him. I thank him. I pray he enjoys this book.
—埃雷兹
—Erez
谢谢你,伊娜:你是我认识的最了不起的人,是我的灵感、力量和快乐的源泉。
Thank you, Ina: you are the most amazing person I know, my source of inspiration, strength, and happiness.
我之所以能成为今天这样的人,很大程度上要归功于我的亲人:我的父亲吉尔斯(Gilles)、我的母亲克里斯汀(Christine)、我的姐姐弗洛伦斯(Florence)和我的弟弟马克-安托万(Marc-Antoine)。我每天都想念他们。我的大家庭在我的生活中扮演着举足轻重的角色:住在吉隆坡的马蒂厄(Mathieu)和托马斯(Thomas);住在普罗旺斯的祖父母马内(Mané)和爸爸;以及我所有来自毛里求斯的家人。我必须特别感谢祖母、多米尼克、法布里斯、阿诺、塞德里克、塞西尔、瓦莱丽、奥雷莉和瓦内萨,以及英国和保加利亚的波波维和伊万诺维。他们给了我真正无价的东西:情感上的安全感、稳定感和幸福感,让我能够自由地去冒险。
Most of who I am I owe to my close family: my father, Gilles, and my mother, Christine; my sister, Florence, and my brother, Marc-Antoine. I miss them, every day. My extended family plays no small role in my life: Mathieu and Thomas in Kuala Lumpur; my grandparents Mané and Daddy in Provence; all my family from Mauritius. I must have a special word for Grand-Mère, Dominique, Fabrice, Arnaud, Cédric, Cécile, Valérie, Aurélie, and Vanessa; and the Popovi and the Ivanovi in the UK and Bulgaria. They provide me with something truly invaluable: the emotional safety, the stability, and the happiness that make me free to take bolder risks.
如果我没有与罗伊·基肖尼和马丁·诺瓦克相遇,今天的我将会截然不同。他们让我接触学术研究,改变了我的人生轨迹。我由衷感谢他们给予我空间、支持、指导和自由,让我能够探索科学的边界。我还要再次感谢史蒂芬·平克,他是一位优秀的导师:他的随时待命、体贴周到以及各种形式的支持,给予我的帮助远超他的想象。
I would be a profoundly different person today had I not crossed paths with Roy Kishony and Martin Nowak. They changed the course of my life when they exposed me to academic research. I can’t thank them enough for giving me the space, support, guidance, and freedom to go to the boundaries of science and explore. I want to thank Steven Pinker again for being a wonderful mentor: his availability, thoughtfulness, and multiform support have helped me more than he knows.
我感谢蒂姆·米奇森(Tim Mitchison)始终不渝的科学和精神支持,感谢他让系统生物学成为一个汇聚不同寻常兴趣的平台,也感谢他如此具有前瞻性的思维。我非常感谢克里斯·桑德(Chris Sander)和黛比·马克斯(Debbie Marks)以各种无私的方式支持我,从提供写作和研究的空间,到鼓励我进行充满创意的对话。我衷心感谢迈克·麦考密克(Mike McCormick)、鲍勃·达恩顿(Bob Darnton)、弗朗索瓦·塔代伊(François Taddei)、托尼·格拉夫顿(Tony Grafton)、弗朗哥·莫雷蒂(Franco Moretti)和马修·乔克斯(Matthew Jockers),感谢他们就科学和人文学科展开了许多引人入胜的对话。
I am grateful to Tim Mitchison for his unfailing scientific and moral support, for making Systems Biology a home for somewhat unconventional interests, and simply for being so forward-thinking. I am extremely grateful to Chris Sander and Debbie Marks for supporting me in many selfless ways, from providing space where I could write and do research, to entertaining endlessly creative conversations. I am deeply thankful to Mike McCormick, Bob Darnton, François Taddei, Tony Grafton, Franco Moretti, and Matthew Jockers for many enthralling conversations about science and the humanities.
值得再次强调的是,我非常感谢那些为培养组学做出直接贡献的人们,其中包括 Aviva Aiden、John Bohannon、Martin Camacho、Neva Cherniavsky、Yuri Lin、Peter Norvig、Jon Orwant、Slav Petrov、Benjamin Schmidt 和 Adrian Veres。这真是一段奇妙的旅程。
It is worth repeating how grateful I am to the people who have directly contributed to culturomics, among them Aviva Aiden, John Bohannon, Martin Camacho, Neva Cherniavsky, Yuri Lin, Peter Norvig, Jon Orwant, Slav Petrov, Benjamin Schmidt, and Adrian Veres. This has been quite an adventure.
我感谢汤姆·里利 (Tom Rielly) 和洛根·麦克卢尔 (Logan McClure),他们聚集了我所见过的最富有创造力、最迷人、最振奋人心、却又不失谦逊的一群人——TED 研究员。这些研究员是我以及数百万人的持续灵感源泉。在此,我还要感谢 MEX 的 DAMM 成员们,感谢他们的创造力和整体的冷静。
I thank Tom Rielly and Logan McClure for having gathered the most creative, fascinating, uplifting, yet unpretentious group of people I have ever met, the TED Fellows. These fellows are a continuous source of inspiration to me and to millions of others. On that note, I am grateful to members of DAMM at MEX for their creative energy and general cool.
文化经济学的诞生很大程度上得益于我多年来与朋友和同事们的无数次交流。我的大多数想法最初都是在佩德罗·博尔达洛身上验证的,当时它们还只是个模糊的错误想法。谢谢你,我的朋友,让我肆无忌惮地利用你无限的求知欲(不过,我得承认,我在桌上足球方面确实比你强——既然这都写在书里了,那肯定是真的)。
Culturomics owes a great deal to the countless conversations I’ve had with my friends and colleagues over the years. Most of my ideas were first tested on Pedro Bordalo when they were still barely shadows of a mistaken thought. Thank you, my friend, for letting me shamelessly take advantage of your unbounded intellectual curiosity (still, let the record show that I own you at foosball—since it is written in a book, it must be true).
以下人员通过让我向他们提出想法(通常都经过他们的同意)来帮助发展培养组学:Pamela Yeh、Tami Lieberman 和 Remy Chait(感谢你们帮助我成为更好的合作者);Kalin Vetsigian、Adam Palmer、Tobias Bollenbach 和 Erdal Toprak(感谢你们让我学到的一点生物学知识);Michael Manapat、Daniel Rosenbloom、Alison Hill、Tibor Antal、Anna Dreber、Thomas Pfeiffer 和 Corina Tarnita(感谢你们提供的数学知识);Fabien Azoulay、Marc Azoulay、Côme Denoyel、Neal Desai、Samuel Fraiberger、Bastien Guerin、Thomas Leonard、Nathan Leverence、Sidney Ouarzazi、Thibault Peyronel、Nick Stroustrup 和 Mohamed Toumi。
The following people helped develop culturomics by letting me bounce ideas at them, usually with their consent: Pamela Yeh, Tami Lieberman, and Remy Chait (thank you for helping me be a better collaborator); Kalin Vetsigian, Adam Palmer, Tobias Bollenbach, and Erdal Toprak (thank you for the little biology I know); Michael Manapat, Daniel Rosenbloom, Alison Hill, Tibor Antal, Anna Dreber, Thomas Pfeiffer, and Corina Tarnita (thank you for the math); and Fabien Azoulay, Marc Azoulay, Côme Denoyel, Neal Desai, Samuel Fraiberger, Bastien Guerin, Thomas Leonard, Nathan Leverence, Sidney Ouarzazi, Thibault Peyronel, Nick Stroustrup, and Mohamed Toumi.
当然,还要谢谢你,埃雷兹。过去十年和你一起探索科学前沿是一段奇妙的经历;在很多方面,它塑造了今天的我。
And of course: thank you, Erez. This past decade exploring the frontiers of science with you was a fabulous experience; in so many ways, it shaped the person I am today.
—JB
—JB
关于图表
About the Charts
本书中的图表灵感源自兰德尔·门罗 (Randall Munroe) 的 xkcd 网络漫画 http://xkcd.com/ 的赏心悦目的视觉风格。自动化 xkcd 风格图表生成的想法由 Damon McDougall 提出;本书中的实际图表则使用 Python 编写,使用了 Jake VanderPlas 修改过的代码。这些 ngram 可以在原版 Google Ngram Viewer(位于 http://books.google.com/ngrams/)上以交互形式生成,也可以在 http://xkcd.culturomics.org 上以 xkcd 风格生成。希望门罗不介意。请参阅 http://xkcd.com/1007/ 和 http://xkcd.com/1140/。他最喜欢的一些 ngram 可以在 http://xkcd.com/ngram-charts/ 上查看。
The charts in this book were inspired by the delightful visual style of Randall Munroe’s xkcd Web comic, http://xkcd.com/. The idea of automating xkcd-style graph production was proposed by Damon McDougall; the actual graphs in this book were created in Python, using a modified version of code by Jake VanderPlas. These ngrams can be generated, in interactive form, at the original Google Ngram Viewer, located at http://books.google.com/ngrams/, and, in xkcd style, at http://xkcd.culturomics.org. We hope Munroe won’t mind. See http://xkcd.com/1007/ and http://xkcd.com/1140/. Some of his favorite ngrams appear at http://xkcd.com/ngram-charts/.
请注意,ngram 数据区分大小写,且 ngram 图取决于多个参数。除非本注释另有说明,文中显示的所有 ngram 图表均与 Google Ngram Viewer 的结果完全一致,该视图使用 2012 年英语语料库并经过三年平滑处理。除非另有说明,查询文本全部小写,专有名词除外,专有名词以通常的方式大写。所有底层数据集也可以在http://books.google.com/ngrams/datasets下载。
Note that ngram data is case-sensitive, and ngram plots depend on several parameters. Unless otherwise indicated in these notes, all ngram charts shown in the text correspond exactly to the results of the Google Ngram Viewer, using the English 2012 corpus and three-year smoothing. Unless otherwise noted, the query text is entirely in lowercase, except for proper nouns, which are capitalized in the usual way. All of the underlying datasets can also be downloaded at http://books.google.com/ngrams/datasets.
当引用特定的 ngram(例如德语语料库中的Marc Chagall和Kubismus)时,我们会以 NV 的形式引用:“Marc Chagall, Kubismus”/德语。如果没有列出语料库,则引用 2012 年英语语料库,例如 NV:“cubism”。我们有时也会标注年份范围或平滑值。
When referring to particular ngrams, such as Marc Chagall and Kubismus in the German corpus, we will cite them as NV: “Marc Chagall, Kubismus”/German. If no corpus is listed, the reference is to the English 2012 corpus, e.g., NV: “cubism.” We sometimes note a year range or a smoothing value as well.
在出版物中使用 ngram 数据时,请引用 Jean-Baptiste Michel、Yuan Kui Shen、Aviva Presser Aiden、Adrian Veres、Matthew K. Gray、Google Books Team、Joseph P. Pickett、Dale Hoiberg、Dan Clancy、Peter Norvig、Jon Orwant、Steven Pinker、Martin A. Nowak 和 Erez Lieberman Aiden 的“利用数百万本数字化图书对文化进行定量分析”,Science 331,第 6014 期(2011 年 1 月 14 日;2010 年 12 月 16 日提前在线发表):176–82。
When using ngram data in a publication, please cite Jean-Baptiste Michel, Yuan Kui Shen, Aviva Presser Aiden, Adrian Veres, Matthew K. Gray, The Google Books Team, Joseph P. Pickett, Dale Hoiberg, Dan Clancy, Peter Norvig, Jon Orwant, Steven Pinker, Martin A. Nowak, and Erez Lieberman Aiden, “Quantitative Analysis of Culture Using Millions of Digitized Books,” Science 331, no. 6014 (January 14, 2011; published online ahead of print December 16, 2010): 176–82.
第一章 爱丽丝镜中奇遇记
CHAPTER 1. THROUGH THE LOOKING GLASS
简介
Intro
“the United States in their treaties with His Britannic Majesty.” Emphasis added.
宪法。宪法本身将“合众国”视为复数。例如,“叛国罪,仅指对合众国发动战争。”参见美国宪法第三条第三款。
The Constitution. The Constitution itself treats the United States as a plural. For instance, “Treason against the United States, shall consist only in levying War against them.” See U.S. Const., art. III, §3.
is/are 的转换发生的时间。复数问题在1901年显然仍是一个热门话题。当时,曾在本杰明·哈里森总统手下担任国务卿的约翰·W·福斯特在《纽约时报》上发表了一篇文章,探讨单复数形式的优劣。参见约翰·W·福斯特的《Are 还是 Is?动词是复数还是单数与 United States 一词的关系》,《纽约时报》,在线版,网址:http://goo.gl/Ql60b。
When the is/are switch took place. The plural question clearly remained a live issue in 1901, when John W. Foster, who had served as secretary of state under President Benjamin Harrison, wrote an article debating the merits of the singular and plural forms in the New York Times. See John W. Foster, “Are or Is? Whether a Plural or a Singular Verb Goes with the Words United States,” New York Times, online at http://goo.gl/Ql60b.
“……战争的某些重大后果显而易见。”这句话出自詹姆斯·M·麦克弗森的《自由的呐喊》(牛津:牛津大学出版社,1988年),第859页。我们希望麦克弗森教授不要太介意我们纠正他那部当之无愧的著作《自由的呐喊》中的一个错误。我们强调这一点,并非为了批评他的历史敏锐性,而是因为麦克弗森作为历史学家,堪称佼佼者。要展现这些机械方法的实用性,最好的方法就是展示即使是最伟大的历史学家也能运用它们。
“. . . Certain large consequences of the war seem clear.” The quote is from James M. McPherson, Battle Cry of Freedom (Oxford: Oxford University Press, 1988), 859. We hope Professor McPherson does not mind, too much, our correcting an error in his deservedly celebrated work Battle Cry of Freedom. We highlighted it not as a criticism of his historical acumen, but precisely because McPherson, as historians go, is the best of the best. There is no better way to demonstrate the utility of these mechanical methods than by showing how even the greatest historians can use them.
“几年前有段时间。”这句话出自 1887 年 4 月 24 日的《华盛顿邮报》,引自 Ben Zimmer 的《生活在这些,呃,这个美国》,《语言日志》 ,2005 年 11 月 24 日,http://goo.gl/Ug8iX。
“There was a time a few years ago.” The quote is from the Washington Post, April 24, 1887, as quoted in Ben Zimmer, “Life in These, uh, This United States,” Language Log, November 24, 2005, http://goo.gl/Ug8iX.
图表。NV:“美国是,美国是。”请注意,没有首字母大写字母,可以捕获不需要的表述,例如The Senate of the United States is,其中is不是指美国,而是指美国参议院。
Chart. NV: “The United States is, The United States are.” Note that without the initial capital letter, one captures unwanted formulations, such as The Senate of the United States is, in which the is does not refer to the United States but to The Senate of the United States.
光的形状
The Shape of the Light
镜片。文森特·伊拉迪(Vincent Ilardi)在《从眼镜到望远镜的文艺复兴视野》(费城:美国哲学学会,2007年)一书中详细介绍了这些发展的历史
Lenses. A richly detailed history of these developments appears in Vincent Ilardi, Renaissance Vision from Spectacles to Telescopes (Philadelphia: American Philosophical Society, 2007).
罗伯特·胡克。埃雷兹在撰写本书期间,访问了瑞典乌普萨拉大学,有机会研读了1665年第一版胡克的《显微图谱:或称通过放大镜观察和探究而制作的微小物体的生理描述》。即使以现代的标准来看,胡克手绘的显微镜所见插图也令人叹为观止。很难想象在当时,这些插图的视觉震撼力有多么惊人。《显微图谱》是第一本科学畅销书,是科普文学的鼻祖。尽管如此,最初印刷的副本仍然非常罕见。如今,数字图书革命已开启:任何人都可以在线阅读原版。请参阅罗伯特·胡克的《显微图谱》(伦敦:Jo. Martyn and Ja. Allestry,1665年),在线网址为http://goo.gl/KSnaH。
Robert Hooke. While writing this book, Erez visited Uppsala University in Sweden, where he had the opportunity to examine a 1665 first edition of Hooke’s Micrographia: or some physiological descriptions of minute bodies made by magnifying glasses with observations and inquiries thereupon. Even by modern standards, Hooke’s hand-drawn illustrations of what he saw through the microscope are spectacular. It is hard to imagine how visually stunning they would have been at the time. Micrographia was the first scientific bestseller, the ur-text of the popular science genre. Still, copies from the initial print run are very rare. Enter the digital book revolution: Today, anyone can peruse the original online. See Robert Hooke, Micrographia (London: Jo. Martyn and Ja. Allestry, 1665), online at http://goo.gl/KSnaH.
微生物。其发现者安东尼·范·列文虎克最初将其称为“微生物”。参见克利福德·多贝尔著《安东尼·范·列文虎克和他的“小动物”(纽约:哈考特出版社,布雷斯出版社,1932年)。你体内的细菌细胞数量是人类细胞的十倍。参见DC·萨维奇著《胃肠道微生物生态学》,载于《微生物学年鉴》第31卷(1977年):107页,网址:http://goo.gl/hzVlrR。我们体内的细菌数量大约是人类总数的10 ^14倍,即一百万亿。
Microbes. First called animalcules by their discoverer, Antonie van Leeuwenhoek. See Clifford Dobell, Antony van Leeuwenhoek and His “Little Animals” (New York: Harcourt, Brace, 1932). Your own body contains ten times as many bacterial cells as human cells. See D. C. Savage, “Microbial Ecology of the Gastrointestinal Tract,” Annual Review of Microbiology 31 (1977): 107, online at http://goo.gl/hzVlrR. The bacteria that live inside us outnumber the human population by a factor of about 1014, or one hundred trillion.
伽利略望远镜的放大倍数。伽利略最早的望远镜并不像伽利略望远镜那么好;经过几轮改进才实现了30倍的放大倍数。参见理查德·S·韦斯特福尔,《科学与赞助:伽利略与望远镜》,《Isis》 76卷,第1期(1985年3月):11-30页,在线访问:http://goo.gl/eiPt3U;亨利·C·金,《望远镜的历史》(伦敦:C. Griffin出版社,1955年)。
Magnification achieved by Galileo’s telescope. Galileo’s very first telescopes were not as good; 30X was achieved only after several rounds of improvements. See Richard S. Westfall, “Science and Patronage: Galileo and the Telescope,” Isis 76, no. 1 (March 1985): 11–30, online at http://goo.gl/eiPt3U; Henry C. King, The History of the Telescope (London: C. Griffin, 1955).
伽利略与现代性的关系。参见大卫·怀特豪斯著《文艺复兴天才:伽利略·伽利莱及其对现代科学的遗产》(纽约:斯特林出版社,2009年);大卫·伍顿著《伽利略:天空的守望者》(康涅狄格州纽黑文:耶鲁大学出版社,2010年);马克·布雷克著《科学革命:伽利略和达尔文如何改变我们的世界》(纽约:帕尔格雷夫·麦克米伦出版社,2009年);让·迪茨·莫斯著《天体新奇:哥白尼论战中的修辞与科学》(芝加哥:芝加哥大学出版社,1993年);罗伯特·S·韦斯特曼著《哥白尼问题:预言、怀疑论与天体秩序》(伯克利:加州大学出版社,2011年)。
Galileo’s relationship to modernity. See David Whitehouse, Renaissance Genius: Galileo Galilei and His Legacy to Modern Science (New York: Sterling, 2009); David Wootton, Galileo: Watcher of the Skies (New Haven, CT: Yale University Press, 2010); Mark Brake, Revolution in Science: How Galileo and Darwin Changed Our World (New York: Palgrave Macmillan, 2009); Jean Dietz Moss, Novelties in the Heavens: Rhetoric and Science in the Copernican Controversy (Chicago: University of Chicago Press, 1993); Robert S. Westman, The Copernican Question: Prognostication, Skepticism, and Celestial Order (Berkeley: University of California Press, 2011).
数羊
Counting Sheep
文字的诞生。人类早期文字史的揭开,很大程度上得益于丹尼斯·施曼特-贝塞拉特的开创性工作。施曼特-贝塞拉特称之为“符号体系的罗塞塔石碑”——古代文字考古学中最重要的发现之一——是一块在伊拉克努齐发现的空心石板,年代可追溯到公元前两千年。石板外侧的楔形文字铭文写道:“21只产羔的母羊//6只母羊羔//8只成年公羊//4只公羊羔//6只产羔的母山羊//1只公山羊//3只母山羊//牧羊人齐卡鲁的印章。” 打开石板后,发现里面有49个算子:每个算子对应外侧列出的每种动物。为什么会重复呢?因为外侧的铭文很容易查阅,但也很容易被篡改。碑文内部虽然难以辨认,但很难被篡改。因此,合同双方如果发生纠纷,可以通过打开碑文,露出里面的算符来裁决。学者们认为,一段时间后,人们意识到楔形文字既可以用于碑文内部,也可以用于碑文外部,从而无需算符,也使得仅使用文字书写的法律文件成为可能。将一部分文字留空以便查阅,将一部分文字封存以便裁决纠纷,这种订立合同的做法变得很普遍;这种合同的例子出现在希伯来圣经耶利米书 32:10-11 中。参见 Barry B. Powell 著《写作:文明技术理论与历史》(英国奇切斯特:Wiley-Blackwell,2009 年);Richard Rudgley 著《石器时代失落的文明》(纽约:自由出版社,1999 年);Denise Schmandt-Besserat著《写作的起源》 (奥斯汀:德克萨斯大学出版社,1996 年);Denise Schmandt-Besserat 著《写作之前》第一卷,从数算到楔形文字(奥斯汀:德克萨斯大学出版社,1992 年);Denise Schmandt-Besserat 著《写作之前》第一卷, 2, 《近东标记目录》(奥斯汀:德克萨斯大学出版社,1992年)。当然,研究人员之间很少有一致的意见。一些人认为文字在埃及独立出现,可能通过一种截然不同的机制出现。参见Larkin Mitchell,《最早的埃及象形文字》,《考古学》 52卷,第2期(1999年3/4月),在线访问:http://goo.gl/tM3GEQ。
The birth of writing. The early history of human writing was uncovered in large part through the pioneering work of Denise Schmandt-Besserat. What Schmandt-Besserat has called “the Rosetta stone of the token system”—one of the most important finds in the archaeology of ancient writing—is a hollow tablet discovered at Nuzi in Iraq from the second millennium BCE. The cuneiform inscription on the outside of the tablet reads: “21 ewes that lambed//6 female lambs//8 full-grown male sheep//4 male lambs//6 she-goats that kid//1 he-goat//3 female kids//The Seal of Ziqarru, the shepherd.” When the tablet was opened up, forty-nine counters were found inside: one for each animal listed on the outside. Why the redundancy? The inscription on the outside could be easily referred to, but also could be easily tampered with. The inside, though difficult to refer to, would be hard to tamper with. Thus, a dispute between the parties to the contract could be adjudicated by breaking open the tablet to reveal the counters inside. Scholars believe that, after some time, people realized that cuneiform could be used on the inside as well as the outside, eliminating the need for counters and making it possible to create legal documents that used writing alone. The practice of creating contracts in which part of the writing was left “open” for easy reference, and part was sealed for the purpose of adjudicating disputes, became common; an example of this sort of contract appears in the Hebrew Bible, Jeremiah 32:10–11. See Barry B. Powell, Writing: Theory and History of the Technology of Civilization (Chichester, England: Wiley-Blackwell, 2009); Richard Rudgley, The Lost Civilizations of the Stone Age (New York: Free Press, 1999); Denise Schmandt-Besserat, How Writing Came About (Austin: University of Texas Press, 1996); Denise Schmandt-Besserat, Before Writing, vol. 1, From Counting to Cuneiform (Austin: University of Texas Press, 1992); Denise Schmandt-Besserat, Before Writing, vol. 2, A Catalog of Near Eastern Tokens (Austin: University of Texas Press, 1992). Of course, unanimity is rare among researchers. Some argue that writing emerged independently in Egypt, possibly via a quite different mechanism. See Larkin Mitchell, “Earliest Egyptian Glyphs,” Archaeology 52, no. 2 (March/April 1999), online at http://goo.gl/tM3GEQ.
大数据
Big Data
位和字节。经典游戏“二十个问题”也可以称为“两个半字节”,因为这就是你在猜测之前可以收集的信息量。
Bits and bytes. The classic game twenty questions could also be called “two and a half bytes,” because that’s how much information you’re allowed to collect before you guess.
五泽字节。预测基于IDC的“数字宇宙”报告。参见John Gantz和David Reinsel合著的《2020年的数字宇宙》,EMC公司,2012年12月,http://idcdocserv.com/1414。另请参阅《经济学人》 2010年2月25日刊载的《数据,无处不在的数据》,http://goo.gl/VsXh5P;Roger E. Bohn和James E. Short合著的《信息量有多大?2009》,全球信息产业中心,2010年1月,http://goo.gl/pt0R;Peter Lyman和Hal R. Varian合著的《信息量有多大?2003》,加州大学伯克利分校,http://goo.gl/vpo9N。
Five zettabytes. Projection based on the IDC “Digital Universe” report. See John Gantz and David Reinsel, “The Digital Universe in 2020,” EMC Corporation, December 2012, http://idcdocserv.com/1414. See also “Data, Data Everywhere,” Economist, February 25, 2010, online at http://goo.gl/VsXh5P; Roger E. Bohn and James E. Short, “How Much Information? 2009,” Global Information Industry Center, January 2010, http://goo.gl/pt0R; Peter Lyman and Hal R. Varian, “How Much Information? 2003,” University of California at Berkeley, http://goo.gl/vpo9N.
写出信息。我们假设典型的比特需要6毫米才能写入。这在一定程度上取决于1和0的比例,因为“1”非常窄。Vikram Kamath等人在《自动手写分析系统的开发》( ARPN Journal of Engineering and Applied Sciences,第6卷,第9期,2011年9月)一文中列出了手写文本的典型字母大小,该文在线发表于http://goo.gl/4mlkTm。
Writing out information. We assume the typical bit takes six millimeters to write. This depends to some extent on the ratio of ones to zeros, since “1” is very narrow. Typical letter sizes for handwritten text are noted in Vikram Kamath et al., “Development of an Automated Handwriting Analysis System,” ARPN Journal of Engineering and Applied Sciences 6, no. 9 (September 2011), online at http://goo.gl/4mlkTm.
数羊。因此,除非宇宙膨胀得太厉害,数羊是一个完全可以解决的问题。
Sheep counting. Thus, counting sheep is a completely solved problem, unless the universe expands very considerably.
翻倍率。根据IDC的估计,人类的数据足迹将从2005年的130EB增长到2020年的40000EB(40ZB)。这意味着翻倍率约为一年零十个月。参见上文。
Doubling rate. According to IDC estimates, humanity’s data footprint will grow from 130 exabytes in 2005 up to 40,000 exabytes (40 zettabytes) in 2020. This suggests a doubling rate of about one year and ten months. See above.
数码镜头
The Digital Lens
Facebook的规模。请参阅美联社2012年10月4日报道“Facebook用户突破10亿”,网址:http://goo.gl/nfK32P。
Size of Facebook. See “Facebook Tops 1 Billion Users,” Associated Press, October 4, 2012, online at http://goo.gl/nfK32P.
乔恩·莱文(Jon Levin)。参见利兰·艾纳夫(Liran Einav)等著,“从在线市场卖家实验中学习”,美国国家经济研究局(2011年9月),网址:http://goo.gl/f9ghir。
Jon Levin. See Liran Einav et al., “Learning from Seller Experiments in Online Markets,” National Bureau of Economic Research (September 2011), online at http://goo.gl/f9ghir.
詹姆斯·福勒。参见罗伯特·M·邦德等,《一项6100万人参与的社会影响与政治动员实验》,《自然》杂志489卷,第7415期(2012年):第295-98页,在线访问:http://goo.gl/AQdAS0。
James Fowler. See Robert M. Bond et al., “A 61-Million-Person Experiment in Social Influence and Political Mobilization,” Nature 489, no. 7415 (2012): 295–98, online at http://goo.gl/AQdAS0.
阿尔伯特·拉斯洛·巴拉巴西。参见 Chaoming Song 等人,“人类移动性可预测性的限制”, Science 327,no。 5968 (2010): 1018–21,在线 http://goo.gl/rYlF2v。
Albert-László Barabási. See Chaoming Song et al., “Limits of Predictability in Human Mobility,” Science 327, no. 5968 (2010): 1018–21, online at http://goo.gl/rYlF2v.
杰里米·金斯伯格。请参阅 Jeremy Ginsberg 等,“利用搜索引擎查询数据检测流感疫情”,《自然》 457 (2009): 1012–1014,在线网址:http://goo.gl/WHEWW。
Jeremy Ginsberg. See Jeremy Ginsberg et al., “Detecting Influenza Epidemics Using Search Engine Query Data,” Nature 457 (2009): 1012–14, online at http://goo.gl/WHEWW.
拉吉·切蒂 (Raj Chetty)。参见拉吉·切蒂、约翰·N·弗里德曼和乔纳·E·罗科夫合著的《教师的长期影响》,美国国家经济研究局(2011 年 12 月),在线网址:http://goo.gl/C18JQ;拉吉·切蒂等人合著的《你的幼儿园课堂如何影响你的收入?》,美国国家经济研究局(2011 年 3 月),在线网址:http://goo.gl/N9O6a。
Raj Chetty. See Raj Chetty, John N. Friedman, and Jonah E. Rockoff, “The Long-Term Impacts of Teachers,” National Bureau of Economic Research (December 2011), online at http://goo.gl/C18JQ; Raj Chetty et al., “How Does Your Kindergarten Classroom Affect Your Earnings?,” National Bureau of Economic Research (March 2011), online at http://goo.gl/N9O6a.
内特·西尔弗(Nate Silver)。参见内特·西尔弗的《FiveThirtyEight》,http://www.fivethirtyeight.com;内特·西尔弗的《信号与噪声》 (纽约:企鹅出版社,2012 年)。
Nate Silver. See Nate Silver, FiveThirtyEight, http://www.fivethirtyeight.com; Nate Silver, The Signal and the Noise (New York: Penguin, 2012).
万物图书馆
The Library of Everything
每本书。这究竟意味着什么?将每本书的每本都数字化意义不大——尽管我们也不会说完全没有意义;人们的旁注可能很有意思。参见安东尼·格拉夫顿和乔安娜·温伯格合著的《我一直热爱圣言》(马萨诸塞州剑桥:哈佛大学出版社,2011年)。另一方面,几个世纪以来,最著名的作品可能会出现许多版本,而且彼此之间可能存在很大差异。这可能会变得相当棘手。例如,参见埃里克·拉姆齐的《谷歌图书搜索:多个版本给出古怪的结果》,《看图》,2010年10月12日,http://goo.gl/6YNld。就谷歌图书而言,其目标是将每个版本的书籍都数字化一份。
Every book. What does this actually mean? There’s not much point in digitizing every copy of every book ever written—although we wouldn’t say that there’s no point at all; people’s marginal notes can be fascinating. See Anthony Grafton and Joanna Weinberg, I Have Always Loved the Holy Tongue (Cambridge, MA: Harvard University Press, 2011). On the other hand, numerous editions of the most famous works can appear over the centuries and can differ substantially from one another. This can get pretty hairy. See, for instance, Eric Rumsey, “Google Book Search: Multiple Editions Give Quirky Results,” Seeing the Picture, October 12, 2010, http://goo.gl/6YNld. In the case of Google Books, the goal is to digitize one copy of every book edition.
斯坦福数字图书馆技术项目。参见斯坦福大学“斯坦福数字图书馆技术项目”,http://goo.gl/tstLQ;谷歌图书“谷歌图书历史”,http://goo.gl/ueobb。
Stanford Digital Library Technologies Project. See “The Stanford Digital Library Technologies Project,” Stanford University, http://goo.gl/tstLQ; “Google Books History,” Google Books, http://goo.gl/ueobb.
谷歌图书的规模。部分出于上述原因,部分是因为书籍作为实体的定义较为模糊,因此计算实体图书馆的藏书数量也颇具挑战性。因此,各图书馆的藏书数量均来自2013年7月18日维基百科上各图书馆的页面。请注意,这些数据并非均为最新数据。另请注意,斯坦福大学已开始关闭实体图书馆,代之以“无书图书馆”。请参阅Lisa M. Krieger的《斯坦福大学为‘无书图书馆’做准备》,《圣何塞水星报》 ,2010年5月18日,在线网址:http://goo.gl/yauezp。
Size of Google Books. Partly for the reasons pointed out above, and partly because the definition of a book, as a physical object, is ambiguous, counting the number of books in a physical library also can be tricky. As such, the number of books in each library was obtained from the library’s page on Wikipedia on July 18, 2013. Note that these numbers are not equally up-to-date. Also note that Stanford is already beginning to close physical libraries and replace them with “bookless libraries.” See Lisa M. Krieger, “Stanford University Prepares for the ‘Bookless Library,’” San Jose Mercury News, May 18, 2010, online at http://goo.gl/yauezp.
长数据
Long Data
我们查阅的书籍。例如,参见路易斯·F·克利普斯坦 (Louis F. Klipstein) 的《盎格鲁-撒克逊语言语法》(Grammar of the Anglo-Saxon Language )(纽约:乔治·P·普特南出版社,1848 年)的电子版,在线访问:http://goo.gl/cWRlJ。需要注意的是,出于法律和伦理方面的考虑,哈佛大学最终退出了谷歌图书计划,只允许谷歌对已过版权保护的作品进行数字化。参见劳拉·G·米尔维斯 (Laura G. Mirviss) 的《哈佛-谷歌在线图书交易面临风险》,《哈佛深红报》,2008 年 10 月 30 日,在线访问:http://goo.gl/0tYflD。
The books we checked. For instance, see the digitized edition of Louis F. Klipstein, Grammar of the Anglo-Saxon Language (New York: George P. Putnam, 1848), online at http://goo.gl/cWRlJ. Note that, in light of legal and ethical concerns, Harvard ended up opting out of the Google Books program, allowing Google to digitize only out-of-copyright works. See Laura G. Mirviss, “Harvard-Google Online Book Deal at Risk,” Harvard Crimson, October 30, 2008, online at http://goo.gl/0tYflD.
长数据。该术语最近由社交网络研究员 Samuel Arbesman 提出。请参阅 Samuel Arbesman 的文章“停止炒作大数据,开始关注‘长数据’”,《连线》,2013 年 1 月 29 日,http://goo.gl/X7oEC。
Long data. This term was recently coined by social network researcher Samuel Arbesman. See Samuel Arbesman, “Stop Hyping Big Data and Start Paying Attention to ‘Long Data,’” Wired, January 29, 2013, http://goo.gl/X7oEC.
数据越多,问题越多
Mo’ Data, Mo’ Problems
数据共享问题。尽管最佳实证数据集尚未广泛普及,但社交网络仍然是一个内容丰富的研究领域。例如,请参阅 Duncan J. Watts 和 Steven H. Strogatz 的《“小世界”网络的集体动力学》,《自然》杂志393 卷,第 6684 期(1998 年):第 440–442 页,在线访问 http://goo.gl/be3Xmi;Albert-László Barabási 和 Réka Albert 的《随机网络中尺度的出现》,《科学》杂志286 卷,第 5439 期(1999 年):第 509–12 页,在线访问 http://goo.gl/eESUa8;Ron Milo 等人的《网络基序:复杂网络的简单构建模块》,《科学》杂志298 卷,第 6684 期(1998 年):第 440–442 页,在线访问 http://goo.gl/eESUa8。 5594 (2002):824–27,在线网址为http://goo.gl/duzS5L。
Problems sharing data. Despite the fact that the best empirical datasets are not broadly available, social networks remain a rich area for research. See, for instance, Duncan J. Watts and Steven H. Strogatz, “Collective Dynamics of ‘Small-World’ Networks,” Nature 393, no. 6684 (1998): 440–42, online at http://goo.gl/be3Xmi; Albert-László Barabási and Réka Albert, “Emergence of Scaling in Random Networks,” Science 286, no. 5439 (1999): 509–12, online at http://goo.gl/eESUa8; Ron Milo et al., “Network Motifs: Simple Building Blocks of Complex Networks,” Science 298, no. 5594 (2002): 824–27, online at http://goo.gl/duzS5L.
律师。注意,有时候律师也可能是个好兆头。我们当中有人嫁给了一位律师。
Lawyers. Note that sometimes, lawyers can be a good omen. One of us is married to a lawyer.
文化组学
Culturomics
文化组学启动。我们最初发布了四份资源来总结我们的研究成果:一篇科学论文、一份详细的方法论补充材料和两个补充网站。参见 Jean-Baptiste Michel 等人的《利用数百万本数字化图书进行文化定量分析》,《科学》第 331 卷,第 6014 期(2011 年 1 月 14 日),在线访问 http://goo.gl/mahoN;详尽的补充文本,在线访问 http://goo.gl/1e509;“Ngram Viewer”,谷歌图书,2010,http://books.google.com/ngrams;“文化组学”,文化观察站,http://www.culturomics.org。由于我们将在笔记中频繁引用Michel等人的文章,因此我们将参考文献缩写为Michel2011。我们将使用Michel2011S来引用本文的补充文本。
Launch of culturomics. We initially released four resources summarizing our findings: a scientific paper, a detailed methodological supplement, and two supplemental Web sites. See Jean-Baptiste Michel et al., “Quantitative Analysis of Culture Using Millions of Digitized Books,” Science 331, no. 6014 (January 14, 2011), online at http://goo.gl/mahoN; extensive supplemental text, online at http://goo.gl/1e509; “Ngram Viewer,” Google Books, 2010, http://books.google.com/ngrams; “Culturomics,” Cultural Observatory, http://www.culturomics.org. Because we will refer to Michel et al. frequently in these notes, we will abbreviate the reference as Michel2011. We will use Michel2011S to refer to the paper’s supplemental text.
我们的新视野。请参阅上文“Ngram Viewer”;Erez Lieberman Aiden 和 Jean-Baptiste Michel,“文化组学、Ngrams 和新的科学工具”,谷歌研究博客,2011 年 8 月 10 日,http://goo.gl/FSbbP;Jon Orwant,“Ngram Viewer 2.0”,谷歌研究博客,2012 年 10 月 18 日,http://goo.gl/zOSfg。
Our new scope. See “Ngram Viewer,” above; Erez Lieberman Aiden and Jean-Baptiste Michel, “Culturomics, Ngrams and New Power Tools for Science,” Google Research Blog, August 10, 2011, http://goo.gl/FSbbP; Jon Orwant, “Ngram Viewer 2.0,” Google Research Blog, October 18, 2012, http://goo.gl/zOSfg.
一张图片值多少字?
How many words is a picture worth?
布里斯班对营销人员的演讲。 1911年,他在美国纽约州锡拉丘兹广告俱乐部发表演讲的节选刊登在美国广告业第一份行业刊物《印刷油墨》(Printers' Ink)上。这些节选包含了该表达的最早记录形式:“用一张图片,胜过千言万语。” 更简洁的形式“一张图片胜过千言万语”随后出现,同时出现的还有“一万”和“百万”的变体;最初,这三个版本通常都被认为是布里斯班说的。他很可能在不同的语境中说过这三个版本。参见《印刷油墨》 75卷,第1期(1911年4月6日):17。到1925年,这句话被直接归于孔子。
Brisbane’s speech to marketers. In 1911, extracts from his talk to the Syracuse, New York, Advertising Club appeared in Printers’ Ink, the first American trade publication for the advertising industry. These extracts contain the earliest recorded form of the expression: “Use a picture. It’s worth a thousand words.” The more compact form, “A picture is worth a thousand words,” appears shortly thereafter, as do the “ten thousand” and “million” variants; initially, all three versions are typically attributed to Brisbane. It’s quite possible that he said all three in different contexts. See Printers’ Ink 75, no. 1 (April 6, 1911): 17. By 1925, the phrase was being attributed directly to Confucius.
另请参阅《管理会计》,全国成本会计师协会(1925 年)。
See also Management Accounting, National Association of Cost Accountants (1925).
第二章 GK Zipf 和化石猎人
CHAPTER 2. G. K. ZIPF AND THE FOSSIL HUNTERS
简介
Intro
“美丽,美丽,美丽,美丽。”参见卡伦·雷默(Karen Reimer)的《传奇的、词汇的、饶舌的爱》(芝加哥:Sara Ranchouse,1996 年)。
“beautiful beautiful beautiful beautiful beautiful.” See Karen Reimer, Legendary, Lexical, Loquacious Love (Chicago: Sara Ranchouse, 1996).
凯伦·雷默。更准确地说,这本书的封面上写着“凯伦·雷默以伊芙·莱默的笔名撰写了这本书”。更多关于凯伦·雷默作品的信息,请访问http://www.karenreimer.info。
Karen Reimer. More precisely, the book’s cover attributes the book to “Karen Reimer writing as Eve Rhymer.” For more info about Karen Reimer’s work, see http://www.karenreimer.info.
问题儿童
Problem Child
大数据。大数据趋势出现得有点太晚,在书籍中还很难看出端倪;参见第六章关于书籍记录时间分辨率的讨论。其他大数据足以说明问题。根据谷歌趋势, 2011 年之前,谷歌上大数据的搜索量相对持平,之后开始激增。维基百科上关于“大数据”的文章创建于 2010 年 4 月;截至 2013 年 7 月 14 日,该文章已被编辑 694 次,每月浏览量超过 15 万次,是英文维基百科上第 2022 受欢迎的文章。参见:“大数据”,谷歌趋势,2013 年,http://goo.gl/tL8GnD;“大数据”,维基百科,2013 年 7 月 14 日,http://goo.gl/DFFbr; “大数据:修订历史”,维基百科,2013 年 7 月 14 日,http://goo.gl/Jvla3;“大数据”,X! 的编辑计数器,2013 年 7 月 14 日,http://goo.gl/e9YZ7v;“大数据”,维基百科文章流量统计,2013 年 7 月 14 日,http://goo.gl/vgYxH。
Big data. The big data trend is a bit too recent to be easily seen in books; see our discussion of the time resolution of the book record in chapter 6. Other big data will have to suffice. According to Google Trends, the search volume for big data at Google was relatively flat until 2011, and then began to surge. The Wikipedia article on “Big Data” was created in April 2010; as of July 14, 2013, it has been edited 694 times, is viewed more than 150,000 times a month, and is the 2,022nd most popular article on the English Wikipedia. See: “Big data,” Google Trends, 2013, http://goo.gl/tL8GnD; “Big Data,” Wikipedia, July 14, 2013, http://goo.gl/DFFbr; “Big Data: Revision History,” Wikipedia, July 14, 2013, http://goo.gl/Jvla3; “Big Data,” X!’s Edit Counter, July 14, 2013, http://goo.gl/e9YZ7v; “Big Data,” Wikipedia Article Traffic Statistics, July 14, 2013, http://goo.gl/vgYxH.
进化动力学计划。没有比阅读诺瓦克关于该主题的著作更好的方法来了解这个地方、这项研究以及负责人了。参见马丁·A·诺瓦克与罗杰·海菲尔德合著的《超级合作者》(纽约:自由出版社,2011年)。
Program for Evolutionary Dynamics. There’s no better way to get a sense of the place, the research, and the man in charge than through Nowak’s book on the topic. See Martin A. Nowak with Roger Highfield, SuperCooperators (New York: Free Press, 2011).
夜晚太阳会去往何处?这个问题的答案在伽利略·伽利莱1632年出版的一部颇具争议的著作中有所探讨。参见他的《关于托勒密与哥白尼两大世界体系的对话》(由斯蒂尔曼·德雷克译,纽约:现代图书馆,2001年)。
Where the sun goes at night. The answer is discussed in a controversial work, originally published in 1632, by Galileo Galilei. See his Dialogue Concerning the Two Chief World Systems, Ptolemaic and Copernican, trans. Stillman Drake (New York: Modern Library, 2001).
天空为何呈现蓝色?这一现象源于瑞利散射,由瑞利勋爵发现。当时,他的名字是约翰·斯特拉特。参见约翰·斯特拉特,《论天空中的光、其偏振和色彩》,《哲学杂志》第41卷,第4辑(1871年):107-120页,274-279页。
Why the sky is blue. The effect is due to Rayleigh scattering, discovered by Lord Rayleigh. At the time, his name was John Strutt. See John Strutt, “On the Light from the Sky, Its Polarization and Colour,” Philosophical Magazine 41, series 4 (1871): 107–20, 274–79.
一棵树能否长得像山一样高。参见George W. Koch等人,《树高的限制》,《自然》杂志428期(2004年4月22日):851-854,在线访问:http://goo.gl/lxNlq。
Whether a tree could grow as tall as a mountain. See George W. Koch et al., “The Limits to Tree Height,” Nature 428 (April 22, 2004): 851–54, online at http://goo.gl/lxNlq.
为什么你必须睡觉。参见卡洛斯·申克的《睡眠》(纽约:企鹅出版社,2007年)。尽管关于睡眠的书籍不胜枚举,但没有人真正了解我们为什么需要睡眠。这对理论家来说是一个有趣的领域。例如,参见范·M·萨维奇和杰弗里·B·韦斯特的《理解哺乳动物睡眠的定量理论框架》,《美国国家科学院院刊》(2006年11月20日),在线访问:http://goo.gl/wFWDC。
Why you have to go to sleep. See Carlos Schenck, Sleep (New York: Penguin, 2007). Despite the existence of numerous books on the subject, nobody really knows why we need to sleep. It’s a fun area for theorists. See, for instance, Van M. Savage and Geoffrey B. West, “A Quantitative, Theoretical Framework for Understanding Mammalian Sleep,” PNAS: Proceedings of the National Academy of Sciences (November 20, 2006), online at http://goo.gl/wFWDC.
恐龙猎人
Dinosaur Hunters
人类学是一门科学。参见尼古拉斯·韦德,《人类学是一门科学吗?声明加深了裂痕》,《纽约时报》 ,2010年12月9日,在线版,网址:http://goo.gl/eCI9K3。
Anthropology as science. See Nicholas Wade, “Anthropology a Science? Statement Deepens a Rift,” New York Times, December 9, 2010, online at http://goo.gl/eCI9K3.
内森·米尔沃德(Nathan Myhrvold)。参见内森·米尔沃德、克里斯·杨和马克辛·比莱特合著《现代主义烹饪:烹饪的艺术与科学》(华盛顿州贝尔维尤:烹饪实验室出版社,2011年);马尔科姆·格拉德威尔,《在空中》,《纽约客》 ,2008年5月12日,在线版,网址:http://goo.gl/TTtsLU。
Nathan Myhrvold. See Nathan Myhrvold, Chris Young, and Maxine Bilet, Modernist Cuisine: The Art and Science of Cooking (Bellevue, WA: The Cooking Lab, 2011); Malcolm Gladwell, “In the Air,” New Yorker, May 12, 2008, online at http://goo.gl/TTtsLU.
1937年:数据奥德赛
1937: A Data Odyssey
的频率 。2000年英文书籍中出现的频率:每百个单词 4.6 个。
Frequency of the. Frequency in English books in 2000: 4.6 per hundred words.
静止的频率。2000年英文书籍中出现的频率:每五百万词中出现两个。
Frequency of quiescence. Frequency in English books in 2000: two in every five million words.
今天统计单词数量。以下 Linux 命令会生成一个文本文件中所有 1-gram 的列表,按出现频率从高到低排序:
Counting words today. The following Linux command produces a list of all 1-grams in a text file, ordered from most frequent to least frequent:
cat 文本文件.txt | tr ' ' '\n' | sort | uniq -c | sort -k1 -n -r > 1grams.txt
cat textfile.txt | tr ‘ ‘ ‘\n’ | sort | uniq -c | sort -k1 -n -r > 1grams.txt
人类计算机。其中许多是女性。她们非凡的故事在大卫·艾伦·格里尔的《当计算机是人类时》(新泽西州普林斯顿:普林斯顿大学出版社,2007年)一书中有所记载。亚马逊的Mechanical Turk服务,标榜为“人工智能”,在某种程度上反映了基于网络、众包模式的回归。参见http://www.mturk.com。
Human computers. Many of these human computers were women. Their remarkable story is told in David Alan Grier, When Computers Were Human (Princeton, NJ: Princeton University Press, 2007). Amazon’s Mechanical Turk service, billed as “artificial artificial intelligence,” reflects in some ways a Web-based, crowdsourced return to this sort of approach. See http://www.mturk.com.
迈尔斯·L·汉利(Miles L. Hanley)。参见迈尔斯·汉利,《詹姆斯·乔伊斯《尤利西斯》词汇索引》(麦迪逊:威斯康星大学出版社,1937年)。
Miles L. Hanley. See Miles Hanley, Word Index to James Joyce’s Ulysses (Madison: University of Wisconsin Press, 1937).
汉利词汇索引在齐普夫著作中的作用。齐普夫第一次接触以他的名字命名的定律是在他对《尤利西斯》中词频的研究之前。1911 年,一位名叫 RC 埃尔德里奇的商人发表了一份使用八页报纸文本计算出的词频列表。埃尔德里奇注意到“经过精心挑选的适量词汇能够让任何两个人理解它们……从而就许多话题进行明智的交谈”,他的目标是使用词汇统计数据来勾勒出“通用词汇的基础”。得出的频率是齐普夫在其 1935 年出版的《语言的心理生物学》一书中计算的基础,该书是齐普夫关于现在称为齐普夫定律的规律的第一部出版物。请参阅 George Kingsley Zipf 的《语言的心理生物学》(波士顿:霍顿·米夫林,1935 年),在线网址为 http://goo.gl/KYvOcK;乔治·金斯利·齐普夫(George Kingsley Zipf),《人类行为与最省力原则》(马萨诸塞州雷丁:Addison-Wesley,1949 年);RC·埃尔德里奇(RC Eldridge),《六千个常用英语单词》(纽约州布法罗:Clement Press,1911 年)。
Role of Hanley’s word index in Zipf’s work. Zipf’s first encounter with the law that bears his name precedes his examination of word frequency in Ulysses. In 1911, a businessman named R. C. Eldridge published a list of word frequencies calculated using eight pages of newspaper text. Having noticed “that a moderate number of words, wisely selected, would enable any two people understanding them . . . to converse intelligently on many subjects,” Eldridge’s goal was to use lexical statistics to outline “the foundations of a universal vocabulary.” The resulting frequencies were the basis of Zipf’s calculations in his 1935 book Psycho-Biology of Language, which is the first of Zipf’s publications on the regularity now known as Zipf’s law. See George Kingsley Zipf, The Psycho-Biology of Language (Boston: Houghton Mifflin, 1935), online at http://goo.gl/KYvOcK; George Kingsley Zipf, Human Behavior and the Principle of Least Effort (Reading, MA: Addison-Wesley, 1949); R. C. Eldridge, Six Thousand Common English Words (Buffalo, NY: Clement Press, 1911).
按频率对《尤利西斯》中的单词进行排序。齐普夫在很大程度上依赖于马丁·朱斯(Martin Joos)编纂的汉利词汇索引附录,其中朱斯列出了大部分必要的统计数据。
Ranking the words in Ulysses by frequency. Zipf was able to rely extensively on an appendix to Hanley’s word index, by Martin Joos, in which Joos tabulated most of the requisite statistics.
齐普夫定律。如果我们不指出齐普夫定律既不是齐普夫定律,也不是一条定律,那就太失职了。它之所以不是一条定律,有几个原因。首先,它只是近似正确;仔细考察就会发现,大多数语言都表现出与纯粹齐普夫行为不同的系统性偏差。其次,尽管存在许多(相互矛盾的)理论推导,但齐普夫定律是否适用于所有语言,或是否只适用于任何一种特定的语言,这一点尚不明确。齐普夫定律最好被认为是一种极其普遍——且颇为神秘——的经验规律。
Zipf’s law. We’d be remiss if we failed to point out that Zipf’s law is neither Zipf’s, nor is it a law. It’s not a law for several reasons. First, it’s only approximately true; on close examination, most languages exhibit systematic deviations away from purely Zipfian behavior. Second, despite many (conflicting) theoretical derivations, it’s not clear that Zipf’s law must hold for all languages, or for any language in particular. Zipf’s law is best thought of as an extremely universal—and rather mysterious—empirical regularity.
这实际上也不是齐普夫的发现,因为齐普夫并非第一个发现它的人。据我们所知,第一个揭示其潜在数学原理的人是法国速记员让-巴蒂斯特·埃斯托普(Jean-Baptiste Estoup),他于1896年就开始发表关于这一主题的探索成果。1912年出版的他广受欢迎的速记笔记著作,其中齐普夫定律在速记领域有着直接且实际的意义。爱德华·康登在1928年发表于《科学》杂志的一篇论文中首次提出了用双对数轴上的秩频图来经典地表示齐普夫定律。康登后来成为一位非常杰出的物理学家,并担任美国物理学会和美国科学促进会的主席。
It’s also not really Zipf’s, because Zipf was not the first to discover it. As far as we know, the first person to uncover the underlying mathematical principle was a French stenographer named Jean-Baptiste Estoup, who began publishing his explorations on this topic in the 1912 edition of his popular work on shorthand note-taking, a discipline in which Zipfian regularities have immediate and practical consequences. The classic representation of Zipf’s law by means of a rank-frequency plot on double-log axes was first introduced by Edward Condon in a 1928 paper in Science. Condon went on to become a very prominent physicist, serving as president of both the American Physical Society and the American Association for the Advancement of Science.
齐普夫于 1935 年首次发表关于齐普夫定律的文章。他似乎独立地重新发现了许多其他人的发现,并用更精确的数据证实了这些发现。(对齐普夫的智力贡献的批判性研究虽然引人入胜,但超出了本文的范围。)齐普夫继续在这个主题上工作了很多年,将基本结果置于理论框架和对整个社会科学中类似现象的广泛考察的背景下。齐普夫也是这些思想最具影响力的整合者和推广者。一篇评论称他 1949 年出版的《人类行为与最省力原则》一书是“有史以来最雄心勃勃的书籍之一……截然不同、令人耳目一新。它跨越了部门和分部的界限,一个世纪以来没有任何一本书能做到这一点。”参见 John Q. Stewart 对乔治·金斯利·齐普所著《人类行为与最省力原则》的评论,Science 110,no. 2868(1949 年 12 月 16 日):669。为了简洁起见,我们在正文中的讨论大致基于本书中给出的处理。
Zipf’s first publication on Zipf’s law appeared in 1935. He appears to have independently rediscovered many of the findings of the others, and confirmed them using much better data. (A critical examination of Zipf’s intellectual debts, although fascinating, is beyond the scope of this text.) Zipf continued to work on the subject for many years, setting the basic results in the context of both a theoretical framework and a broad examination of similar phenomena throughout the social sciences. Zipf also served as the single most influential synthesizer and popularizer of these ideas. A review of his 1949 book Human Behavior and the Principle of Least Effort called it “one of the most ambitious books ever written . . . altogether different and refreshing. It cuts across departmental and divisional boundaries as nothing else has for a century.” See John Q. Stewart, review of Human Behavior and the Principle of Least Effort, by George Kingsley Zipf, Science 110, no. 2868 (December 16, 1949): 669. For the sake of conciseness, our discussion in the main text is loosely based on the treatment given in this book.
不过,考虑到这个概念的更完整历史,齐普夫定律有没有更准确的名称呢?齐普夫定律应该被称为埃斯托普-康登-齐普夫规律,这种说法相当合理。即便如此,也并不完全公平。齐普夫的工作得益于汉利、朱斯和埃尔德里奇进行的词汇索引和计数。康登的工作同样基于其他人进行的频率分析——就他而言,是伦纳德·艾尔斯和戈弗雷·杜威(梅尔维尔·杜威的儿子,杜威十进制分类法的发明者)。所以,我们实际上应该称齐普夫定律为埃斯托普-康登-齐普夫-埃尔德里奇-艾尔斯-杜威-汉利-朱斯规律。这可能就是我们坚持使用齐普夫定律的原因。
Still, given the fuller history of this concept, is there a more accurate name for Zipf’s law? It’s pretty reasonable to argue that Zipf’s law really ought to be called the Estoup-Condon-Zipf regularity. Even that’s not totally fair. Zipf’s work was made possible by the word indexing and counting that had been performed by Hanley, Joos, and Eldridge. Condon’s work, too, was based on frequency analyses performed by others—in his case, Leonard Ayres and Godfrey Dewey (son of Melvil Dewey, who invented the Dewey decimal system). So really we should call Zipf’s law the Estoup-Condon-Zipf-Eldridge-Ayres-Dewey-Hanley-Joos regularity. This is probably why we just stick to Zipf’s law.
无论如何,几乎可以肯定的是,所有基于对真正令人印象深刻的数据集进行艰苦分析而得出的发现,都不会以生成基础数据的人的名字来命名。既然我们忙着命名,不妨也发个安慰奖。我们姑且称之为汉利原则。
Anyway, it’s almost a truism that every finding based on painstaking analysis of a really impressive dataset is not named after the person who generated the underlying data. While we’re busy naming things, we might as well hand out consolation prizes. Call this one the Hanley principle.
请参阅 Jean-Baptiste Estoup 的《速记游戏》(巴黎:速记研究所,1916 年);EU Condon 的《词汇统计》,《科学》第 67 卷,第 1733 期(1928 年 3 月 16 日):300,在线网址为 http://goo.gl/Qi5B49;Leonard P. Ayres 的《拼写能力测量量表》(纽约:罗素·塞奇基金会,1915 年),在线网址为 http://goo.gl/C0cgke;Godfrey Dewey 的《英语语音相对频率》(马萨诸塞州剑桥:哈佛大学出版社,1923 年); M. Petruszewycz,“L'Histoire de la Loi d'Estoup-Zipf:文献”,Mathématiques et Sciences Humaines 44 (1973):41-56,在线网址:http://goo.gl/LlrNn。
See Jean-Baptiste Estoup, Gammes Sténographiques (Paris: Institut Sténographique, 1916); E. U. Condon, “Statistics of Vocabulary,” Science 67, no. 1733 (March 16, 1928): 300, online at http://goo.gl/Qi5B49; Leonard P. Ayres, A Measuring Scale for Ability in Spelling (New York: Russell Sage Foundation, 1915), online at http://goo.gl/C0cgke; Godfrey Dewey, Relative Frequency of English Speech Sounds (Cambridge, MA: Harvard University Press, 1923); M. Petruszewycz, “L’Histoire de la Loi d’Estoup-Zipf: Documents,” Mathématiques et Sciences Humaines 44 (1973): 41–56, online at http://goo.gl/LlrNn.
Willem Levelt 的《心理语言学史》(牛津:牛津大学出版社,2012 年)一书对这些思想进行了简明扼要的回顾。Nelson HF Beebe 的《本福德定律、希普斯定律和齐普夫定律出版物书目》(盐湖城:犹他大学出版社,2013 年)提供了关于齐普夫定律及其相关原理的详尽参考书目,该书在线版本可访问 http://goo.gl/TuyT0。一个相关概念是“1/f 噪声”。参见 Benoit B. Mandelbrot 的《多重分形与 1/f 噪声:物理学中的狂野自亲和性》(纽约:Springer,1999 年)。
A brief and elegant review of these ideas appears in Willem Levelt, A History of Psycholinguistics (Oxford: Oxford University Press, 2012). A very extensive bibliography on Zipf’s law and related principles is given in Nelson H. F. Beebe, A Bibliography of Publications about Benford’s Law, Heaps’ Law, and Zipf’s Law (Salt Lake City: University of Utah, 2013), online at http://goo.gl/TuyT0. A related concept is the notion of “1/f noise.” See Benoit B. Mandelbrot, Multifractals and 1/f Noise: Wild Self-Affinity in Physics (New York: Springer, 1999).
齐普夫眼中的世界
The World According to Zipf
人类身高分布。参见CD Fryar、Q. Gu和CL Ogden合著的《儿童和成人人体测量参考数据:美国,2007-2010年》,《生命健康统计》第11卷,第252期(2012年),网址:http://goo.gl/uEuiV。
Distribution of human height. See C. D. Fryar, Q. Gu, and C. L. Ogden, “Anthropometric Reference Data for Children and Adults: United States, 2007–2010,” Vital Health Statistics 11, no. 252 (2012), online at http://goo.gl/uEuiV.
幂律。更准确地说,当一个量与另一个量成比例,且指数或幂为一个固定值时,该量就被称为幂律。齐普夫定律就是一个幂律定律,其中两个量分别是秩和丰度,指数等于1。如果这两个量适用于网络,则底层网络通常被称为“无标度网络”。参见Steven H. Strogatz,《探索复杂网络》,《自然》 410,第6825期(2001年):268-76,在线访问http://goo.gl/gO6Eb4。如果这两个量适用于几何结构,且指数不是整数,则底层结构有一个专门的术语:分形。参见Benoit Mandelbrot,《自然的分形几何》(旧金山:WH Freeman出版社,1985年)。
Power laws. More precisely, something is said to be a power law when one quantity is proportional to another quantity, elevated to a fixed exponent, or power. Zipf’s law is a power law in which the two quantities are rank and abundance, and the exponent equals one. If the quantities pertain to a network, the underlying network is in general known as “scale-free.” See Steven H. Strogatz, “Exploring Complex Networks,” Nature 410, no. 6825 (2001): 268–76, online at http://goo.gl/gO6Eb4. When the two quantities pertain to a geometric structure, and the exponent is not an integer, there is a special word for the underlying structure: a fractal. See Benoit Mandelbrot, The Fractal Geometry of Nature (San Francisco: W. H. Freeman, 1985).
虽然齐普夫是最早发现词频幂律的人之一,但更早的研究人员已经在完全不同的学科中发现了其他幂律。最著名的是经济学家维尔弗雷多·帕累托观察到,意大利80%的土地归20%的人所有。这是众多类似80/20规则中的第一个。从数学上讲,这种偏差与幂律密切相关。
Although Zipf was among the first to identify a power law in word frequencies, earlier researchers had discovered other power laws in entirely different disciplines. Most notably, the economist Vilfredo Pareto observed that 80 percent of the land in Italy was owned by 20 percent of the people. This was the first of many such 80/20 rules. This sort of skew is, mathematically speaking, closely associated with power laws.
许多幂律关系最早由齐普夫于1949年在《齐普夫与幂律分布》一书中提出,他还收集了其他人的许多研究成果。更多近期研究成果,请参阅Aaron Clauset、Cosma Rohilla Shalizi和MEJ Newman合著的《经验数据中的幂律分布》,《SIAM评论》 51卷,第4期(2009年):661-703页,在线网址:http://goo.gl/6PLJFF;Manfred Schroeder合著的《分形、混沌、幂律:来自无限天堂的瞬间》(纽约:WH Freeman出版社,1991年)。此类关系无处不在,以至于在看似狭窄的领域中也能找到大量的例子。例如,请参阅Ignacio Rodríguez-Iturbe和Andrea Rinaldo合著的《分形河流盆地:机遇与自组织》(英国剑桥:剑桥大学出版社,2001年)。
Many power-law relationships were first reported by Zipf in Zipf, 1949, where he also collects many findings by others. For more recent surveys, see Aaron Clauset, Cosma Rohilla Shalizi, and M. E. J. Newman, “Power-Law Distributions in Empirical Data,” SIAM Review 51, no. 4 (2009): 661–703, online at http://goo.gl/6PLJFF; Manfred Schroeder, Fractals, Chaos, Power Laws: Minutes from an Infinite Paradise (New York: W. H. Freeman, 1991). Such relations are so ubiquitous that there can be a vast array of examples in seemingly narrow fields. See, for instance, Ignacio Rodríguez-Iturbe and Andrea Rinaldo, Fractal River Basins: Chance and Self-Organization (Cambridge, England: Cambridge University Press, 2001).
比尔·盖茨与月球。根据2010年的人口普查数据,美国家庭净资产中位数(不包括房屋净值)为1.5万美元。2010年3月,《福布斯》杂志估计比尔·盖茨的净资产为530亿美元。身高五英尺七英寸(约1.7米)等于1.7米。因此,在我们假设的场景中,盖茨的身高约为6007公里。这远高于冥王星(直径2390公里)、水星(直径4879公里)和月球(直径3474公里),几乎与火星(直径6792公里)一样高。即使算上房屋净值,美国家庭净资产中位数达到66740美元,盖茨的身高也达到1350公里,超过冥王星高度的一半。请参阅《世界亿万富翁:威廉·盖茨三世》,《福布斯》 ,2010 年 3 月 10 日,http://goo.gl/8ykj;《财富与资产所有权》,美国人口普查局,2013 年 7 月 11 日,http://goo.gl/llnbC,尤其是《2010 年财富表》,美国人口普查局,http://goo.gl/v7mxk。
Bill Gates vs. the moon. According to the 2010 census, the median net worth of American households, excluding home equity, was $15,000. In March 2010, Forbes estimated Bill Gates’ net worth at $53 billion. Five-foot-seven is 1.7 meters. Thus in our hypothetical scenario, Gates would be about 6,007 kilometers tall. This is far taller than Pluto (diameter 2,390 kilometers), Mercury (diameter 4,879 kilometers), and the moon (diameter 3,474 kilometers); it’s nearly as tall as Mars (diameter 6,792 kilometers). Even if home equity is included, bringing the median household net worth up to $66,740, he would still be 1,350 kilometers tall, more than half the height of Pluto. See “The World’s Billionaires: William Gates III,” Forbes, March 10, 2010, http://goo.gl/8ykj; “Wealth and Asset Ownership,” U.S. Census Bureau, July 11, 2013, http://goo.gl/llnbC and in particular “Wealth Tables 2010,” U.S. Census Bureau, http://goo.gl/v7mxk.
齐普夫定律背后的原因。参见MEJ Newman,《幂律、帕累托分布和齐普夫定律》,《当代物理学》第46卷,第5期(2005年),在线访问:http://goo.gl/nrkMB。关于随机猴子的解释,参见George A. Miller,《间歇性沉默的一些影响》,《美国心理学杂志》第70卷,第2期(1957年6月):第311-14页,在线访问:http://goo.gl/p6PLll。
Reasons behind Zipf’s law. See M. E. J. Newman, “Power Laws, Pareto Distributions and Zipf’s Law,” Contemporary Physics 46, issue 5 (2005), online at http://goo.gl/nrkMB. The random monkeys explanation appears in George A. Miller, “Some Effects of Intermittent Silence,” American Journal of Psychology 70, no. 2 (June 1957): 311–14, online at http://goo.gl/p6PLll.
太 Zipf 或不太 Zipf
Too Zipf or Not Too Zipf
不规则动词。有关这个引人入胜的话题的详尽介绍,请参阅史蒂芬·平克的《词汇与规则:语言的成分》(纽约:基础图书出版社,1999年)。根据你的观点,不规则动词要么奇特,要么令人愉悦地古怪。一位女士曾在《纽约书评》上刊登征婚广告,开头是:“你是一个不规则动词吗?” 请参阅史蒂芬·平克的《语言本能》(纽约:威廉·莫罗出版社,1994年),第134页。
The irregular verbs. For a rich and detailed introduction to this fascinating topic, see Steven Pinker, Words and Rules: The Ingredients of Language (New York: Basic Books, 1999). Depending on your perspective, irregular verbs are either strange or delightfully quirky. A woman once ran a personals ad in the New York Review of Books that began: “Are you an irregular verb?” See Steven Pinker, The Language Instinct (New York: William Morrow, 1994), 134.
少数人、骄傲的人、强者
The Few, the Proud, the Strong
学习不规则动词。孩子们掌握不规则动词的方式特别有趣,他们会经历一些与他们日益复杂的思维相对应的阶段。起初,他们会用自己独特的方式运用所有动词。然后,他们开始识别周围语言中固有的规则。当他们意识到大多数动词都遵循-ed规则时,他们就会进入一个称为超规则化的阶段,在这个阶段,他们会把所有动词都视为规则动词,并说出诸如goed、 knowed和runned之类的动词。最终,他们会意识到某些动词是-ed规则的例外,并逐渐开始在语音中融入正确的不规则形式。
Learning irregular verbs. Children master irregular verbs in a particularly fascinating way, going through characteristic stages that correspond to their increasingly sophisticated minds. At first, they conjugate all verbs idiosyncratically. Then they begin to recognize the rules inherent in the language spoken around them. When they realize that most verbs obey the -ed rule, they pass into a stage called hyperregularization, in which they treat every verb as regular and say things like goed and knowed and runned. Eventually, they realize that certain verbs are exceptions to the -ed rule and gradually begin to incorporate the correct irregular forms into their speech.
原始印欧语与元音变换。参见JP Mallory和DQ Adams著《原始印欧语和原始印欧语世界牛津导论》(牛津:牛津大学出版社,2006年);Don Ringe著《英语语言史》(牛津:牛津大学出版社,2006年)。
Proto-Indo-European and the ablaut. See J. P. Mallory and D. Q. Adams, The Oxford Introduction to Proto-Indo-European and the Proto-Indo-European World (Oxford: Oxford University Press, 2006); Don Ringe, A Linguistic History of English (Oxford: Oxford University Press, 2006).
齿后缀的出现。与强不规则动词不同,规则动词也被称为“弱动词”。参见Detlef Stark,《古英语弱动词》(德国图宾根:M. Niemeyer出版社,1982年);Robert Howren,《古英语弱动词的产生》,《语言》第43卷,第3期(1967年9月),在线访问:http://goo.gl/2yf0t。
The emergence of the dental suffix. Unlike the strong irregulars, regulars are also known as “weak.” See Detlef Stark, The Old English Weak Verbs (Tübingen, Germany: M. Niemeyer, 1982); Robert Howren, “The Generation of Old English Weak Verbs,” Language 43, no. 3 (September 1967), online at http://goo.gl/2yf0t.
不规则化。规则化通常是单行道,但也有极其罕见的例外。其中之一就是不规则形式snuck,它在上个世纪悄然进入英语。继stick/stuck、 strike/struck和stink/stunk等不规则动词之后,每年约有1%的英语使用者将sneaked转换为snuck 。照此速度,当你读到这句话的时候,就会有一个人悄悄溜走。参见Steven Pinker,《不规则动词》,《 Landfall 》 (2000年秋季):83-85,在线访问:http://goo.gl/kFFzLm。
Irregularization. Regularization is usually a one-way street, but there are extremely rare exceptions. One is the irregular form snuck, which sneaked into the English language this past century. Following the lead of irregular verbs like stick/stuck, strike/struck, and stink/stunk, about 1 percent of English speakers are switching from sneaked to snuck each year. At this rate, one person will have snuck off while you read this sentence. See Steven Pinker, “The Irregular Verbs,” Landfall (Autumn 2000): 83–85, online at http://goo.gl/kFFzLm.
2005年:又一次数据之旅
2005: Another Data Odyssey
为什么我们说 driven 。实际上,现代英语中没有完全不规则动词。即使它的频率很低,规则形式也总是存在,等待时机。频率对这种现象有非常大的影响,因为频繁出现的不规则动词在抑制竞争的规则形式方面做得更好。与driven相比, drived的信号可以忽略不计。这可能使driving保持安全。相比之下, throve几个世纪以来一直显得脆弱;规则化形式thrived在二十世纪开始获胜,但在此之前很久就已经是一个强大的竞争对手。这是一个非常普遍的影响。在我们的 ngram 数据中,我们发现found (频率:2,000 分之一)的频率是founded的 200,000 倍。但dwelt(频率:100,000 分之一)在我们的数据中的频率仅为 dwelled 的 60 倍。请参阅 Michel2011。
Why we say drove. Actually, there’s no such thing as a completely irregular verb in Modern English. Even if its frequency is very low, the regular form always exists, biding its time. Frequency has a very strong effect on this phenomenon, because frequent irregulars do a much better job of suppressing the competing regular form. Compared to drove, the signal for drived is negligible. That probably keeps drove safe. In contrast, throve has been looking vulnerable for centuries; the regularized form thrived started winning in the twentieth century but was already a formidable competitor long before that. This is a very general effect. In our ngram data, we found found (frequency: 1 in 2,000) 200,000 times more often than we finded finded. But dwelt (frequency: 1 in 100,000) dwelt in our data only 60 times as often as dwelled dwelled. See Michel2011.
请注意,为了完成我们2007年的研究,我们偶尔需要一份可视为“权威”的现代英语不规则动词列表。例如,我们用这样的列表来确定哪些动词已经规则化,哪些还没有。自行整理这样的列表可能会让我们的方法容易受到“选择性采纳”的影响,因此我们使用了S. Pinker和A. Prince在《论语言与联结主义:语言习得的并行分布式处理模型分析》(Cognition 28 (1988): 73–193)中提出的列表。我们认为,任何至少有一个词义根据此列表可进行不规则变位的动词都是不规则动词。请注意,词典和其他资料来源之间偶尔会对哪些动词是不规则动词、哪些动词不是不规则动词存在分歧。例如,根据上述列表, wed/wed仍然是不规则动词,但并非所有当代词典都如此。(有些词典已经倾向于使用wed/wedded。)
Note that, for the purpose of our 2007 study, we occasionally needed a list of Modern English irregular verbs that we could regard as “authoritative.” For instance, we used such a list to determine which verbs have regularized and which have not. Curating such a list on our own could leave our method vulnerable to concerns about cherry-picking, so we used the list that appears in S. Pinker and A. Prince, “On Language and Connectionism: Analysis of a Parallel Distributed Processing Model of Language Acquisition,” Cognition 28 (1988): 73–193. We regarded as irregular any verb that has at least one sense which is conjugated as an irregular according to this list. Note that there is occasional disagreement between dictionaries and other sources about which verbs are irregular and which are not. For instance, wed/wed remains irregular according to the above list, but not according to all contemporary dictionaries. (Some already favor wed/wedded.)
现存的语言学假设。琼·L·拜比(Joan L. Bybee)在其著作《形态学:意义与形式关系研究》 (阿姆斯特丹:约翰·本杰明出版社,1985年)中探讨了频率与规则化之间的关系。更广泛地说,关于语言演变如何发生的研究已有很多。例如,参见威廉·拉波夫(William Labov),《传播与扩散》(Transmission and Diffusion),《语言》(Language)第83卷,第2期(2007年6月):第344-387页,在线访问:http://goo.gl/aZ5M2R;格雷维尔·科贝特(Greville Corbett)等,《频率、规则性和范式:从俄语视角看一种复杂关系》( Frequency and the Emergence of Linguistic Structure),琼·L·拜比和保罗·J·霍珀编(阿姆斯特丹:约翰·本杰明出版社,2001年),第201-28页。这些问题也可以从更明确的进化视角来探讨。参见马克·佩格尔,《文化连线:人类社会思维的起源》(纽约:WW·诺顿出版社,2012年);马克·佩格尔、昆汀·D·阿特金森和安德鲁·米德,《词语使用频率预测印欧语系历史中词汇演变的速度》,《自然》第449卷(2007年10月11日):717-720页,在线访问:http://goo.gl/93WiJ0。另见帕塔·尼约吉,《计算自然》语言学习与进化论(马萨诸塞州剑桥:麻省理工学院出版社,2009 年);Niyogi 是该领域的杰出人物,于 2010 年不幸去世。享年 43 岁。
Extant linguistic hypotheses. The relationship between frequency and regularization is explored in Joan L. Bybee, Morphology: A Study of the Relation Between Meaning and Form (Amsterdam: John Benjamins, 1985). More generally, there has been a great deal of work on how linguistic change comes about. See, for instance, William Labov, “Transmission and Diffusion,” Language 83, no. 2 (June 2007): 344–87, online at http://goo.gl/aZ5M2R; Greville Corbett et al., “Frequency, Regularity, and the Paradigm: A Perspective from Russian on a Complex Relation,” in Frequency and the Emergence of Linguistic Structure, ed. Joan L. Bybee and Paul J. Hopper (Amsterdam: John Benjamins, 2001), 201–28. These questions can also be explored from a more explicitly evolutionary perspective. See Mark Pagel, Wired for Culture: Origins of the Human Social Mind (New York: W. W. Norton, 2012); Mark Pagel, Quentin D. Atkinson, and Andrew Meade, “Frequency of Word-Use Predicts Rates of Lexical Evolution Throughout Indo-European History,” Nature 449 (October 11, 2007): 717–20, online at http://goo.gl/93WiJ0. See also Partha Niyogi, The Computational Nature of Language Learning and Evolution (Cambridge, MA: MIT Press, 2009); Niyogi, a luminary in the field, passed away, tragically, in 2010. He was forty-three.
教科书。例如,包括奥利弗·法拉·爱默生的《中古英语读本》(纽约:麦克米伦出版社,1909年)和亨利·斯威特的《盎格鲁-撒克逊入门书》(牛津:克拉伦登出版社,1887年)。
Textbooks. These include, for instance, Oliver Farrar Emerson, A Middle English Reader (New York: Macmillan, 1909), and Henry Sweet, An Anglo-Saxon Primer (Oxford: Clarendon Press, 1887).
适者生存
Survival of the Fit
检测自然选择。关于这个主题的文献有很多。例如,参见PC Sabeti等人,“从单倍型结构检测人类基因组中的近期正向选择”,《自然》 419,第6909期(2002年):832-837,在线访问http://goo.gl/TW6SYJ;P. Varilly等人,“全基因组检测和表征人类群体中的正向选择”,《自然》 449,第7164期(2007年):913-918,在线访问http://goo.gl/NfnzeU。
Detecting natural selection. There is a massive literature on this topic. See, e.g., P. C. Sabeti et al., “Detecting Recent Positive Selection in the Human Genome from Haplotype Structure,” Nature 419, no. 6909 (2002): 832–37, online at http://goo.gl/TW6SYJ; P. Varilly et al., “Genome-Wide Detection and Characterization of Positive Selection in Human Populations,” Nature 449, no. 7164 (2007): 913–18, online at http://goo.gl/NfnzeU.
我们对不规则动词的分析。本文最初发表于Erez Lieberman等人的论文《量化语言的进化动态》,《自然》 449卷(2007年10月11日):713-716页,在线访问:http://goo.gl/3kCMQT。
Our analysis of irregular verbs. This work originally appeared as Erez Lieberman et al., “Quantifying the Evolutionary Dynamics of Language,” Nature 449 (October 11, 2007): 713–16, online at http://goo.gl/3kCMQT.
放射性和半衰期。参见“放射性衰变”,维基百科,2013年6月22日,http://goo.gl/xTYh1;“半衰期”,维基百科,2013年6月3日,http://goo.gl/TXn3。
Radioactivity and half-life. See “Radioactive Decay,” Wikipedia, June 22, 2013, http://goo.gl/xTYh1; “Half-life,” Wikipedia, June 3, 2013, http://goo.gl/TXn3.
过去与未来
The Once and Future Past
何时将driving规则化。 像driven这样频繁出现的不规则动词的半衰期为 5,400 年,相当于其规则化前的预期寿命约为 7,800 年。
When drove will regularize. The half-life of irregular verbs as frequent as drove is 5,400 years, which is equivalent to an expected lifetime of about 7,800 years before it regularizes.
约翰·哈佛的闪亮鞋子
John Harvard’s Shiny Shoe
游客会用手擦鞋。但让鞋子保持光亮的不仅仅是手。许多本科生也会往鞋上撒尿;2013 年,23% 的哈佛大学毕业生表示曾这样做过。将“厕所”放回约翰哈佛是哈佛本科生“三大”成年礼之一。第二大礼仪是裸体喊叫仪式,被称为原始尖叫。第三大礼仪是在怀德纳图书馆做爱,这表明学生群体对与书籍发生性关系的热情持续高涨。试着用 Kindle 做这件事。请参阅 Julie M. Zauzmer 的“我们的立场:2013 届毕业生调查”,《哈佛深红报》,2013 年 5 月 28 日,在线网址为 http://goo.gl/1EpfA。
Visitors rub the shoe with their hands. But it’s not just hands that keep the shoe shiny. Many undergrads also urinate on the shoe; in 2013, 23 percent of graduating Harvard seniors reported having done so. Putting the “john” back in John Harvard is one of the “big three” rites of passage for Harvard undergraduates. The second is a nude yelling ritual known as primal scream. The third is having sex in Widener Library, demonstrating the student body’s continuing enthusiasm for getting physical with books. Try doing that with a Kindle. See Julie M. Zauzmer, “Where We Stand: The Class of 2013 Senior Survey,” Harvard Crimson, May 28, 2013, online at http://goo.gl/1EpfA.
词典和协和语
Lexicon and Concord
创建索引。有些索引比其他索引更强大。必须指出的是,即使撇开更具挑战性的源材料,Busa 的索引也比 Reimer 的复杂得多。例如,《托米斯提克索引》对基础文本进行了完整的词形还原,将所有单词按词汇相关的类别分组。(在英语中,词形还原会将run、 running、 runs 、 ran、outrun和also-ran等词归入一个标题下。)这种词形还原本身就是一项了不起的成就。我们发布的 ngram 数据集不包含词形还原功能。做好词形还原非常困难。
Creating concordances. Some concordances are more powerful than others. It must be pointed out that, even if you set aside the much more challenging source material, Busa’s concordance was far more sophisticated than Reimer’s. For instance, the Index Thomisticus incorporates a complete lemmatization of the underlying text, grouping all words into lexically related classes. (In English, a lemmatization would group words like run, running, runs, ran, outrun, and also-ran under a single heading.) This lemmatization is itself a remarkable accomplishment. The ngram datasets we released do not feature lemmatization. It’s very hard to do well.
《托米斯提克索引》。 1980年,布萨发表了他与IBM数十年合作的亲身经历。这份文献极具先见之明,蕴含着不胜枚举的洞见。例如,他预见到了对大型人文学科的需求(另见第七章的讨论),他写道:
The Index Thomisticus. In 1980, Busa published a firsthand account of his decades-long collaboration with IBM. It is an astonishingly prescient document, packed with too many insights to enumerate. For instance, anticipating the need for big humanities (see also our discussion in chapter 7), Busa writes:
当今的学术生活似乎更倾向于需要快速发表的许多短期研究项目,而不是需要团队合作数十年的项目。...在一公里宽的基础上每次增加一厘米来积累成果,要比在一厘米的基础上进行一公里的研究要好得多。
Today’s academic life seems to be more in favor of many short-term research projects which need to be published quickly, rather than of projects requiring teams of co-workers collaborating for decades. . . . It would be much better to build up results one centimetre at a time on a base one kilometre wide, than to build up a kilometre of research on a one centimetre base.
三十多年后,时任美国历史协会主席的安东尼·格拉夫顿表达了类似的想法:
More than thirty years later, Anthony Grafton, then president of the American Historical Association, expressed a similar train of thought:
随着新形式的科学研究为历史学家提供了补充文本记录的研究可能性,随着数字档案和展览的扩展以及数字研究方法变得更加容易获得,历史学家必须学会如何组建团队并开展工作。...协作为传统学者提供了一种方法 - 可能是一种非常强大的方法 - 来创建以深厚的档案和文本基础为基础的全球经济、文化和政治关系历史。
As new forms of scientific research offer historians research possibilities that complement the textual record, as digital archives and exhibitions expand and digital research methods become more accessible, historians will have to learn how to form and work in teams. . . . Collaboration offers one way—potentially a very powerful one—for scholars of traditional bent to create global histories of economic, cultural, and political relations that rest on deep archival and textual foundations.
布萨的论述堪称数字人文运动的奠基性文献,至今仍是必读之作。参见R. Busa,《人文计算年鉴:托马斯提克斯索引》 ,载《计算机与人文》第14卷(1980年),第83-90页,在线访问:http://goo.gl/FgVWQ;A. Grafton,《孤独与自由》,载《历史视角》,2011年3月,在线访问:http://goo.gl/dOx3J。
Arguably the founding document of the digital humanities movement, Busa’s account remains required reading to this day. See R. Busa, “The Annals of Humanities Computing: The Index Thomisticus,” Computers and the Humanities 14 (1980): 83–90, online at http://goo.gl/FgVWQ; A. Grafton, “Loneliness and Freedom,” Perspectives on History, March 2011, online at http://goo.gl/dOx3J.
把玫瑰掰开数花瓣
Taking Roses Apart to Count Their Petals
“把玫瑰拆开。”参见GA Miller,《语言的心理生物学》(马萨诸塞州剑桥:麻省理工学院出版社,1965年)导言,在线访问:http://goo.gl/KYvOcK。他1965年导言开篇的完整引言,至今仍具有现实意义:
“Take roses apart.” See G. A. Miller, introduction to The Psycho-Biology of Language (Cambridge, MA: MIT Press, 1965), online at http://goo.gl/KYvOcK. The full quote, from the very beginning of his 1965 introduction, is as relevant today as it ever was:
《语言的心理生物学》并非旨在迎合所有人的口味。齐普夫是那种会把玫瑰掰开数花瓣的人;如果把莎士比亚十四行诗中的不同词语列表出来会违背你的价值观,那么这本书不适合你。齐普夫以科学家的视角看待语言——对他来说,这意味着对语言作为一种生物、心理和社会过程进行统计学分析。如果你厌恶这种分析,那就别管你的语言了,像躲避瘟疫一样躲开乔治·金斯利·齐普夫。读马克·吐温的“世上有骗子,该死的骗子,还有统计学家”或者WH·奥登的“你不应该和统计学家坐在一起,也不应该从事社会科学研究”会让你更快乐。
The Psycho-Biology of Language is not calculated to please every taste. Zipf was the kind of man who would take roses apart to count their petals; if it violates your sense of values to tabulate the different words in a Shakespearean sonnet, this is not a book for you. Zipf took a scientist’s view of language—and for him that meant the statistical analysis of language as a biological, psychological, social process. If such analysis repels you, then leave your language alone and avoid George Kingsley Zipf like the plague. You will be much happier reading Mark Twain: “There are liars, damned liars, and statisticians.” Or W. H. Auden: “Thou shalt not sit with statisticians nor commit a social science.”
然而,对于那些为了正义事业而目睹美丽被谋杀却毫不畏惧的人来说,齐普夫的科学努力产生了一些令人惊奇、出乎意料的结果,令人难以置信,也激发了人们的想象力。
However, for those who do not flinch to see beauty murdered in a good cause, Zipf’s scientific exertions yielded some wonderfully unexpected results to boggle the mind and tease the imagination.
烧焦了,宝贝,烧焦了
Burnt, baby, burnt
迈克尔·菲尔普斯。参见莎莉·詹金斯,《精疲力竭的菲尔普斯在与罗切特的比赛中惨败》,《华盛顿邮报》 ,2012年7月29日。
Michael Phelps. See Sally Jenkins, “Burned-Out Phelps Fizzles in the Water against Lochte,” Washington Post, July 29, 2012.
科比·布莱恩特。参见梅丽莎·罗林,《科比·布莱恩特称从菲尔·杰克逊那里学到了很多》,《洛杉矶时报》,2012年11月14日,在线访问:http://goo.gl/bKGDTg。
Kobe Bryant. See Melissa Rohlin, “Kobe Bryant Says He Learned a Lot from Phil Jackson,” Los Angeles Times, November 14, 2012, online at http://goo.gl/bKGDTg.
不规则动词联盟。请参阅史蒂芬·平克(Steven Pinker)著《词汇与规则:语言的成分》(纽约:基础图书出版社,1999年)中关于此主题的讨论;Lieberman 等著《量化语言的演化动态》及其补充材料;Michel,2011 和 Michel,2011S。
An alliance of irregular verbs. See the discussion of this topic in Steven Pinker, Words and Rules: The Ingredients of Language (New York: Basic Books, 1999); Lieberman et al., “Quantifying the Evolutionary Dynamics of Language,” and its supplemental materials; Michel2011 and Michel2011S.
英国剑桥。我们假设burned与burnt 的频率比反映了英国英语使用者使用每种形式的比例。
Cambridge, England. Here we assume that the burned-to-burnt frequency ratio reflects the proportion of English speakers in the United Kingdom who use each form.
第三章 空谈词典学家
CHAPTER 3. ARMCHAIR LEXICOGRAPHEROLOGISTS
简介
Intro
大脚怪。参见杰夫·梅尔德鲁姆(Jeff Meldrum)著《大脚怪:传说与科学的碰撞》(纽约:Forge出版社,2006年)。
Sasquatch. See Jeff Meldrum, Sasquatch: Legend Meets Science (New York: Forge, 2006).
卓柏卡布拉。 洛伦·科尔曼和杰罗姆·克拉克在《神秘动物学从 A 到 Z》(纽约:Fireside,1999 年)一书中讨论了这些生物以及许多其他生物。需要注意的是,卓柏卡布拉是成群结队的;如果你在一句话中碰巧遇到一只,很有可能附近还有其他潜伏着的。卓柏卡布拉的出现频率正在飙升,所以它们在未来可能会更加常见。
Chupacabra. These creatures, and many more, are discussed in Loren Coleman and Jerome Clark, Cryptozoology A to Z (New York: Fireside, 1999). Note that chupacabras travel in packs; if you happen to run into one in a sentence, there’s a decent chance that there are others lurking nearby. The frequency of chupacabra is surging right now, so they will probably be more common in the future.
Twenty-nine-year-old Billionaire Psychology
谷歌图书。请参阅“Google图书历史”,http://goo.gl/ueobb。
Google Books. See “Google Books History,” http://goo.gl/ueobb.
数字化项目时间估算。 对于密歇根大学来说,五百年只是乘法;科尔曼估计的一千年大概包括了翻书以外的时间,当然,可能没有考虑到只有一个人翻书。假设有1.3亿本书,每本书40分钟,那么全部翻完需要9900年。
Digitization project time estimates. Five hundred years for the University of Michigan is just multiplication; Coleman’s estimate of a thousand years presumably includes time for doing things other than flipping pages, and of course may not have assumed just one person doing the flipping. Assuming 130 million books and forty minutes per book, it would take 9,900 years to do them all.
页面的页面
Page’s Pages
词汇公告。似乎可以构造一个仅由单词page和pages组成的任意长度的英语句子。例如:
Lexical bulletin. It seems possible to construct an English sentence of arbitrary length consisting only of the words page and pages. For instance:
“翻页!”(玛丽莎·梅耶尔命令某人翻页。)
“Page!” (Marissa Mayer, commanding someone to turn pages.)
“页面,页面!”(玛丽莎命令拉里。)
“Page, page!” (Marissa, commanding Larry.)
“翻页,翻页!”(更详细的指令。)
“Page, page pages!” (A more detailed instruction.)
“翻看佩奇的页面!翻看佩奇的页面!”(拉里翻看别人的页面,把事情搞砸了。)
“Page, page Page’s pages!” (By paging someone else’s pages, Larry was screwing things up.)
“佩奇,佩奇,佩奇的页面的页面。”(佩奇的页面落后了。)
“Page, page Page’s page’s pages.” (Page’s page was falling behind.)
“页面,页面页面,页面的页面。”(玛丽莎告诉页面,通常分配给拉里的特定页面会去翻阅这些页面。)
“Page, page pages Page’s page pages.” (Marissa, telling a page to page the pages that the specific page assigned to Larry usually pages.)
1.3亿本书。参见Leonid Taycher的《世界图书,站起来!你们129,864,880本书》,谷歌图书搜索,2010年8月5日,http://goo.gl/5yNV。Taycher是谷歌首席元数据专家。
130 million books. See Leonid Taycher, “Books of the world, stand up and be counted! All 129,864,880 of you,” Google Books Search, August 5, 2010, http://goo.gl/5yNV. Taycher is Google’s chief metadata guru.
无损扫描。任何尝试过复印书页的人都知道,获得高质量的图像并非易事。这只是众多需要克服的问题之一:书页通常不会平放;靠近书页时,它们会向内弯曲。为了解决这个问题,谷歌开发了一套系统,可以根据书页的曲率来校正每幅图像。Michel2011S 对此过程进行了更深入的讨论。
Nondestructive scanning. As anyone who’s ever tried to Xerox a book page would know, getting good images can be tricky. Here’s just one of the many problems that needed to be overcome: Pages in books don’t like to lie flat; they curve inward as one gets close to the binding. To solve this problem, Google developed a system for correcting each image to account for the curvature of the page. A much more extensive discussion of this process appears in Michel2011S.
盖洛普。盖洛普的七天平均值基于对约2700名潜在选民的调查。参见盖洛普“2012年大选潜在选民试探:奥巴马 vs. 罗姆尼”,http://goo.gl/ujbzb。
Gallup. Gallup’s seven-day averages were based on surveys of approximately 2,700 likely voters. See “Election 2012 Likely Voters Trial Heat: Obama vs. Romney,” Gallup, http://goo.gl/ujbzb.
25岁的心理学研究生
Twenty-five-year-old Graduate Student Psychology
彼得·诺维格(Peter Norvig)。他的慕课课程,请参阅“人工智能导论”,https://www.ai-class.com/。他的教材,请参阅斯图尔特·J·拉塞尔(Stuart J. Russell)和彼得·诺维格(Peter Norvig)合著的《人工智能:一种现代方法》(新泽西州恩格尔伍德克利夫斯:普伦蒂斯霍尔出版社,1995年)。
Peter Norvig. For his MOOC, see “Introduction to Artificial Intelligence,” https://www.ai-class.com/. For his textbook, see Stuart J. Russell and Peter Norvig, Artificial Intelligence: A Modern Approach (Englewood Cliffs, NJ: Prentice Hall, 1995).
财富 500 强企业法律部门心理学
Fortune 500 Legal Department Psychology
法律问题。维基百科一直在密切关注这些诉讼及其复杂且持续的进展。请参阅“Google 图书搜索和解案”,维基百科,2013 年 6 月 23 日,http://goo.gl/8E5Cx。Giovanna Occhipinti Trigona 的“Google 图书搜索选择”,《知识产权法律与实践杂志》第 6 卷,第 4 期(2011 年 3 月 10 日):第 262-273 页,以及 Marshall A. Leaffer 的《理解版权法》第 5 版(纽约州奥尔巴尼:Matthew Bender 出版社,2011 年)中更全面的讨论。Charles W. Bailey, Jr. 的“Google 图书参考书目”,《数字学术研究》 ,2011 年,http://goo.gl/grff2,其中包含关于此主题的非常详细的参考书目。请参阅 Thomas C. Rubin 在《寻找原则:在线服务和知识产权》一文中的评论,微软,http://goo.gl/GX3CB。
Legal issues. Wikipedia has been keeping close track of the lawsuits and their complex, ongoing development. See “Google Book Search Settlement,” Wikipedia, June 23, 2013, http://goo.gl/8E5Cx. Some of the legal issues are discussed in Giovanna Occhipinti Trigona, “Google Book Search Choices,” Journal of Intellectual Property Law and Practice 6, no. 4 (March 10, 2011): 262–73, and more generally in Marshall A. Leaffer, Understanding Copyright Law, 5th ed. (Albany, NY: Matthew Bender, 2011). A very detailed bibliography on this topic is kept at Charles W. Bailey, Jr., “Google Books Bibliography,” Digital Scholarship, 2011, http://goo.gl/grff2. See Rubin’s remarks at Thomas C. Rubin, “Searching for Principles: Online Services and Intellectual Property,” Microsoft, http://goo.gl/GX3CB.
大数据带来巨大阴影
Big Data Casts Big Shadows
美国在线。参见Michael Barbaro和Tom Zeller, Jr.合著的《AOL搜索器4417749号曝光》,《纽约时报》,2006年8月9日,http://goo.gl/c8MCY;以及“关于AOL搜索数据丑闻”,http://goo.gl/6hnfuI。
America Online. See Michael Barbaro and Tom Zeller, Jr., “A Face Is Exposed for AOL Searcher No. 4417749,” New York Times, August 9, 2006, http://goo.gl/c8MCY; “About AOL Search Data Scandal,” http://goo.gl/6hnfuI.
现代基因组测序的基础。由于其与基因组测序的相关性,目前已存在一套完善的理论体系,用于分析如何从微小的文本块中高效地组装出完整的文本。该领域文献的分水岭是 Lander-Waterman 统计量的开发。由于基因组测序技术的显著进步,以及哺乳动物基因组复杂的重复结构,这些统计数据实际上至少同样适用于基于 ngram 的全文语料库攻击,就像它们适用于当代基因组测序仪的输出结果一样。参见 ES Lander 和 MS Waterman,“通过指纹识别随机克隆进行基因组图谱绘制”,《基因组学》 2,第 3 期(1988 年 4 月):231-39,在线访问 http://goo.gl/wuAcXr。
The basis of modern genome sequencing. Because of its relevance to genome sequencing, an extensive theoretical apparatus already exists for analyzing the problem of how well you can assemble whole texts from tiny text-tiles. The watershed moment in this literature was the development of the Lander-Waterman statistics. Because of dramatic improvements in genome sequencing technology, and due to the complex repeat structure of mammalian genomes, these statistics actually apply at least as readily to ngram-based attacks on whole text corpora as they do to the output of contemporary genome sequencers. See E. S. Lander and M. S. Waterman, “Genomic Mapping by Fingerprinting Random Clones,” Genomics 2, no. 3 (April 1988): 231–39, online at http://goo.gl/wuAcXr.
自由世界的领袖
Leaders of the Free Word
“土豆”。参见丹·奎尔的《坚定不移》(纽约:哈珀柯林斯出版社,1994年);马克·法斯的《你如何拼写遗憾?一个人的看法》,《纽约时报》,2004年8月29日,在线网址为http://goo.gl/gWW4wK。
“Potatoe.” See Dan Quayle, Standing Firm (New York: HarperCollins, 1994); Mark Fass, “How Do You Spell Regret? One Man’s Take on It,” New York Times, August 29, 2004, online at http://goo.gl/gWW4wK.
Refudiated。佩林在2010年7月18日的一条推文中使用了“1-gram”一词,这在当时非常著名。此前,她曾在电视上用过这个词。参见Max Read,《莎拉·佩林发明新词:‘Refudiate’》, Gawker,2010年7月19日,在线网址:http://goo.gl/XjV7TJ。
Refudiated. Palin famously used the 1-gram in a tweet on July 18, 2010. She had previously used the word on television. See Max Read, “Sarah Palin Invents New Word: ‘Refudiate,’” Gawker, July 19, 2010, online at http://goo.gl/XjV7TJ.
“莎士比亚也喜欢创造新词。”参见迈克尔·马克龙,《刷读你的莎士比亚》(纽约:哈珀柯林斯出版社,1990年);杰弗里·麦奎因和斯坦利·马利斯,《莎士比亚创造的新词》 (马萨诸塞州斯普林菲尔德:韦氏出版社,1998年)。
“Shakespeare liked to coin new words too”. See Michael Macrone, Brush Up Your Shakespeare (New York: HarperCollins, 1990); Jeffrey McQuain and Stanley Malless, Coined by Shakespeare (Springfield, MA: Merriam-Webster, 1998).
使用 Word,还是不使用 Word?
To Word, or Not to Word?
美国传统词典。尽管AHD在语言学上以保守著称,但从方法论的角度来看,它一直极具创新性。
American Heritage Dictionary. Despite its linguistically conservative reputation, AHD has long been, from the methodological standpoint, extremely innovative.
1967年,亨利·库切拉(Henry Kucera)和W·纳尔逊·弗朗西斯(W. Nelson Francis)出版了“布朗语料库”(Brown Corpus),这是一个包含数百万字的文本集合,旨在涵盖广泛的语料类型。这份出版物对语料库语言学作为一门学科的兴起起到了重要作用,因此,从许多方面来看,它都是我们在谷歌创建的语料库的最早、最重要的先驱。
In 1967, Henry Kucera and W. Nelson Francis published the “Brown Corpus,” a million-word text collection meant to be representative of a broad array of genres. This publication proved instrumental in the emergence of corpus linguistics as an academic discipline, and is therefore, in many ways, the earliest and most important forerunner of the corpus we created at Google.
不久之后,出版商霍顿·米夫林与库切拉接洽,希望创建一个语料库,以协助该公司正在编写的新词典。本质上,该出版商打算将埃尔德里奇的策略(参见第二章“1937:数据之旅”的注释)付诸实践,利用词汇统计数据构建英语词汇。霍顿·米夫林出版于1969年的《美国传统词典》第一版是第一本采用此策略的词典。
Shortly thereafter, publisher Houghton Mifflin approached Kucera about creating a corpus to assist with the new dictionary that the company was creating. Essentially, the publisher intended to put the strategy of Eldridge (see the notes to chapter 2’s “1937: A Data Odyssey”) into practice, using lexical statistics to construct a vocabulary of the English language. The first edition of Houghton Mifflin’s American Heritage Dictionary, which appeared in 1969, was the first dictionary to employ such a strategy.
因此,我们很自然地想知道,在我们强大的、基于 Google 图书的新语料库的映衬下,开拓性的AHD 的表现会如何。幸运的是,曾于 1997 年至 2011 年担任AHD执行编辑的 Joseph P. Pickett 很乐意参与其中。因此,我们对《美国传统词典》的所有分析都极大地受益于他的积极合作以及他手下员工的协助。本书中关于AHD的所有数字都基于与 Pickett 及其团队的沟通以及他们提供的数据。(Pickett 最终是 Michel2011 的合著者。)虽然我们确实在文中不时批评AHD ,但很明显AHD认为积极追求新型分析有助于编纂出最好的词典。我们认为语言管理的透明度是一个好主意,而且没有其他参考书能像AHD一样透明。
It was therefore natural to wonder how well the trailblazing AHD might hold up in light of our powerful new Google Books–based corpus. Luckily, Joseph P. Pickett, who was executive editor of the AHD from 1997 to 2011, was happy to participate. Thus, all of our analyses of the American Heritage Dictionary benefited immensely from his active collaboration, as well as from the assistance of his staff. All of the numbers reported about the AHD in this book are based on communication with Pickett and his team, as well as data that they provided. (Pickett was ultimately a coauthor of Michel2011.) Although we do critique the AHD at times in the text, it was clear the AHD felt that aggressively pursuing new types of analysis could help make the best possible dictionary. We think transparency in linguistic governance is a great idea, and no other reference work proved as transparent as the AHD.
AHD以依赖一个使用小组而闻名。该小组由来自各行各业的约两百名语言专家组成,从最高法院大法官安东尼·斯卡利亚到《纽约时报》填字游戏编辑威尔·肖茨,再到普利策奖得主朱诺特·迪亚兹。小组主席由史蒂芬·平克(也是 Michel2011 的合著者)担任。该小组在许多方面与文化组学或文本语料库统计的追踪方法截然相反。语言。它并不依赖于普遍的语言使用代表性样本,而是依赖于少数语言专家——词汇精英。
The AHD famously relies on a usage panel. This panel consists of about two hundred language experts from all walks of life, ranging from Supreme Court justice Antonin Scalia to New York Times crossword editor Will Shortz to Pulitzer Prize–winning author Junot Díaz. It is chaired by Steven Pinker (also a coauthor of Michel2011). The panel represents, in many ways, the opposite of the culturomics or text-corpus-statistics approach to tracking language. It doesn’t rely on representative sampling of language use in general, but instead on a small number of language experts—a lexical elite.
我们想知道这两种方法的比较结果如何。每年,AHD都会向其使用小组发送一份问卷。有一年,AHD允许我们创建自己的问卷补充材料,小组成员也填写了该补充材料。我们将结果与我们的 ngram 发现进行了比较。例如,我们询问了他们关于sneaked和snuck 的问题:小组成员认为哪种过去时形式是可以接受的?我们发现,较年轻的小组成员更有可能认为snuck可以接受(未发表的数据)。我们的 ngram 发现显示, snuck在过去几十年中迅速传播。总而言之,这些结果可能表明,小组成员,或许更普遍地说,语言使用者,倾向于在年轻时形成对可接受或不可接受用法的观念。
We wondered how these two approaches would compare. Each year, the AHD sends out a questionnaire to its usage panel. One year, the AHD allowed us to create our own supplement to this questionnaire, which the panelists also filled out. We compared the results to our ngram findings. For example, we asked them about sneaked and snuck: Which of these past-tense forms did the panelists find acceptable? We found that younger panelists were more likely to find snuck acceptable (unpublished data). Our ngram findings show the rapid spread of snuck in the last few decades. Taken together, these results may suggest that panelists, and perhaps language users more generally, tend to form their notions of what is or is not acceptable usage at a young age.
参见《美国传统英语词典》,第 4 版(波士顿:霍顿·米夫林,2000 年);“用法小组”,《美国传统词典》, 2013 年,http://goo.gl/JtT4l;弗朗西斯·纳尔逊和亨利·库塞拉,《布朗语料库手册》(布朗大学语言学系,1979 年)。
See American Heritage Dictionary of the English Language, 4th ed. (Boston: Houghton Mifflin, 2000); “The Usage Panel,” American Heritage Dictionary, 2013, http://goo.gl/JtT4l; Francis Nelson and Henry Kucera, Brown Corpus Manual (Brown University Department of Linguistics, 1979).
AHD中的词汇数量。AHD团队向我们提供了其词典第四版中所有条目的 153,459 个词条列表。有时,同一个词会在列表中出现多次;例如, “console”既是名词又是动词。我们删除了这些重复出现的词条。我们还删除了非单个词条的词条,例如“ men's room”。最终的词条列表包含 116,156 个词条。
Number of words in the AHD. The AHD team provided us with a list of the 153,459 headwords of all entries in the fourth edition of their dictionary. Sometimes, the same word appeared multiple times on the list; for instance, console appeared as both a noun and a verb. We removed such multiples. We also removed headwords that were not single words, like men’s room. The resulting word list contained 116,156 words.
OED中的单词数量。 此数字是OED最后一个印刷版本,即 1989 年第二版的数字。(包括牛津大学出版社首席执行官 Nigel Portwood 在内的许多人都怀疑第三版永远不会出现印刷版,因为此类引用通常会迁移到网络上。)遗憾的是,我们并未受益于OED。OED网站报告称“定义和/或说明的词形数量”为 615,100。根据序言,此版本还包含 169,000 个不是 1-gram 的“斜体粗体短语和组合”。我们的估计 446,000 只是这两个值之间的差值。这不是一个精确的估计,而是一个上限 - OED第二版的 1-gram 单词不会比这个值多,但可能会少。《牛津英语词典》最近邀请我们作为代表参加一个关于其未来的研讨会,因此,或许可以开展一次更为稳健的AHD式合作。如果能得到确切的数字就更好了。请参阅《牛津英语词典》第二版(牛津:牛津大学出版社,1989 年);“词典事实”,《牛津英语词典》,http://goo.gl/DL6a7;Bas Aarts 和 April McMahon,《英语语言学手册》(新泽西州霍博肯:John Wiley & Sons,2008 年);Alastair Jamieson,《牛津英语词典‘将不再印刷’》 ,《每日电讯报》,2010 年 8 月 29 日,在线网址为 http://goo.gl/V5g8Ak。
Number of words in the OED. This number is for the OED’s last printed edition, the second edition of 1989. (Many, including the CEO of Oxford University Press, Nigel Portwood, suspect the third edition will never appear in print, because of the general migration of such references to the Web.) Alas, we did not have the benefit of OED assistance. The OED Web site reports that the “number of word forms defined and/or illustrated” is 615,100. According to the preface, this edition also contained 169,000 “italicized-bold phrases and combinations,” which are not 1-grams. Our estimate, 446,000, is just the difference between these two values. It is not an exact estimate, but rather an upper bound—the OED’s second edition does not have more 1-gram words than this value, but may have less. The OED recently invited us to be delegates to a symposium on its future, so perhaps a more robust, AHD-style collaboration is in the cards. It sure would be nice to get exact numbers. See Oxford English Dictionary, 2nd ed. (Oxford: Oxford University Press, 1989); “Dictionary Facts,” Oxford English Dictionary, http://goo.gl/DL6a7; Bas Aarts and April McMahon, The Handbook of English Linguistics (Hoboken, NJ: John Wiley & Sons, 2008); Alastair Jamieson, “Oxford English Dictionary ‘Will Not Be Printed Again,’” Telegraph, August 29, 2010, online at http://goo.gl/V5g8Ak.
规范性与描述性。参见琼·阿科塞拉(Joan Acocella)的《英语之争》(The English Wars),《纽约客》( New Yorker ) ,2012年5月14日,在线访问:http://goo.gl/wGVHsx;瑞安·布鲁姆(Ryan Bloom)的《不可避免地,你被你的语言评判》(Inescapably, You're Judged by Your Language),《纽约客》(New Yorker),2012年5月29日,在线访问:http://goo.gl/js9VJc;史蒂芬·平克(Steven Pinker)的《语言之争中的虚假阵线》(False Fronts in the Language Wars),《 Slate》(Slate),2012年5月31日,在线访问:http://goo.gl/33vNYT。学术界也同样存在激烈的争论。例如,参见亨宁·伯根霍尔茨(Henning Bergenholtz)和鲁弗斯·H·古斯(Rufus H. Gouws)的《从功能性角度探讨描述性、规范性和规范性词典编纂学之间的选择》(A Functional Approach to the Choice Between Descriptive, Prescriptive and Proscriptive Lexicography),《 Lexicos》第20卷(2010),在线访问:http://goo.gl/agXm7S。
Prescriptive vs. descriptive. See the ferocious public debates at Joan Acocella, “The English Wars,” New Yorker, May 14, 2012, online at http://goo.gl/wGVHsx; Ryan Bloom, “Inescapably, You’re Judged by Your Language,” New Yorker, May 29, 2012, online at http://goo.gl/js9VJc; Steven Pinker, “False Fronts in the Language Wars,” Slate, May 31, 2012, online at http://goo.gl/33vNYT. The debate also rages in academic circles. See, for instance, Henning Bergenholtz and Rufus H. Gouws, “A Functional Approach to the Choice Between Descriptive, Prescriptive and Proscriptive Lexicography,” Lexicos 20 (2010), online at http://goo.gl/agXm7S.
“公麋”词典编纂。罗斯福当时支持一项由简化拼写委员会最初提出的计划。参见戴维·沃尔曼,《纠正母语:从古英语到电子邮件,英语拼写的错综复杂故事》(纽约:哈珀·佩伦尼利亚出版社,2010年)。罗斯福就此话题所写的一封原始信件,可在迪金森州立大学西奥多·罗斯福中心的《西奥多·罗斯福致威廉·迪安·豪厄尔斯的信》电子传真版中找到,网址:http://goo.gl/JA8cP。
“Bull Moose” lexicography. Roosevelt was supporting a plan first proposed by a group known as the Simplified Spelling Board. See David Wolman, Righting the Mother Tongue: From Olde English to Email, the Tangled Story of English Spelling (New York: Harper Perennial, 2010). An original letter of Roosevelt’s on the topic can be seen, in digital facsimile, at “Letter from Theodore Roosevelt to William Dean Howells,” Theodore Roosevelt Center at Dickinson State University, http://goo.gl/JA8cP.
#大笑!笑得在地上打滚。如果你不知道,别担心:大多数词典也不知道。
#ROFL. Rolling on the floor laughing. If you don’t know this, don’t worry: Most dictionaries don’t, either.
分析。本章其余部分提出的所有分析均在 Michel2011 和 Michel2011S 中详细说明。
Analysis. All the analyses presented in the balance of the chapter are detailed in Michel2011 and Michel2011S.
DIY词典
DIY Dictionary
截止频率。我们计算了《美国传统词典》中 116,156 个独特的 1-gram 词条的频率分布。在十分之一的百分位之后,大约在十亿分之一的范围内,频率开始飙升。
Cutoff frequency. We calculated the frequency distribution of the 116,156 unique 1-gram headwords in the American Heritage Dictionary. After the tenth percentile, at roughly one part per billion, the frequencies begin to soar.
包含非字母字符的单词。 一个单词是否必须完全由字母字符组成尚不明确。例如,《牛津英语词典》最近首次添加了♥这个符号。参见Erica Ho的《牛津英语词典将“ ♥ ”和“LOL”添加为单词》,《时代》杂志,2011年3月25日,在线网址:http://goo.gl/0RB6EA。
Words with nonalphabetical characters. It’s not at all clear that a word has to be composed entirely of alphabetic characters. For instance, the OED recently added, for the first time, an entry for a symbol, ♥. See Erica Ho, “The Oxford-English Dictionary Adds ‘♥’ and ‘LOL’ as Words,” Time, March 25, 2011, online at http://goo.gl/0RB6EA.
创建 Zipfian 词典。需要注意的是,这个 Zipfian 词典只是 Eldridge 所倡导并体现在AHD中的理念的当代更新,该理念认为词汇统计数据可以用来编纂更好的词典。Richard W. Bailey 的《研究词典》一文中对此进行了有力的论证,该文发表于1969 年的《美国演讲》第 44 卷第 3 期,第 166-172 页,在线访问:http://goo.gl/4RqfDu。
Creating a Zipfian lexicon. Note that this Zipfian lexicon is just a contemporary update of the idea, espoused by Eldridge and embodied in the AHD, that lexical statistics could be used to compile better dictionaries. An early, forceful argument to this effect appears in Richard W. Bailey, “Research Dictionaries,” American Speech 44, no. 3 (1969): 166–72, online at http://goo.gl/4RqfDu.
词汇暗物质
Lexical Dark Matter
排除类别。我们选择排除的类别(非字母词、易于从其组成词理解的复合词、变体拼写以及难以定义的词)是基于与《美国传统词典》的约瑟夫·皮克特(Joseph Pickett)的讨论。标准各不相同,但总体而言,词典有意排除某些词的时间与它们有意收录某些词的时间一样长。塞缪尔·约翰逊在其具有里程碑意义的1755年词典中讨论了许多排除词的例子。约翰逊博士在词典序言中对这一主题进行了丰富多彩的思考,虽然他没有讨论非字母词的情况,但确实探讨了其他三类词的挑战。
Excluded categories. Our choice of categories to be excluded (nonalphabetic terms, compounds that are easily understood from their component words, variant spellings, and undefinable terms) was based on discussions with Joseph Pickett of the American Heritage Dictionary. Standards vary somewhat from one to another, but broadly speaking, dictionaries have been deliberately excluding words for as long as they’ve been deliberately including them. Samuel Johnson discusses many examples of excluded words in his landmark 1755 dictionary. Dr. Johnson’s ever-colorful ruminations on this topic in the dictionary’s preface don’t discuss the case of nonalphabetic terms, but do address the challenge of the other three classes.
他几乎忽略了复合词:“我很少注意到复合词或双词,除非它们的含义与构成词本身的含义不同。例如,拦路强盗(highwayman)、樵夫(woodman)和赛马者(horsecourser)需要解释;但盗贼(thieflike)或马车夫(coachdriver)则无需赘述,因为原词本身就包含了复合词的含义。”
Compounds, which he mostly left out: “Compounded or double words I have seldom noted, except when they obtain a signification different from that which the components have in their simple state. Thus highwayman, woodman, and horsecourser, require an explication; but of thieflike or coachdriver no notice was needed, because the primitives contain the meaning of the compounds.”
他保留了大部分变体拼写:“我并没有刻意拒绝任何拼写,仅仅因为它们不必要或过于繁琐;但我接受了不同作者的不同拼写形式,例如 viscid、viscidity、viscous 和 visibilité。”当时的拼写远没有那么标准化。
Variant spellings, which he mostly left in: “I have not rejected any by design, merely because they were unnecessary or exuberant; but have received those which by different writers have been differently formed, as viscid, and viscidity, viscous, and viscosity.” Spelling was much less standardized at the time.
难以定义的术语,见:“还有一些词,它们的含义太过微妙和短暂,无法用释义来固定;这些词都被语法学家称为咒骂词,在死语言中,它们被认为是空洞的声音,除了填充诗句或调节句号之外没有其他用处,但在活的语言中,它们很容易被察觉到具有力量和强调,尽管有时这是其他表达形式无法传达的。”
Hard-to-define terms, in: “Other words there are, of which the sense is too subtle and evanescent to be fixed in a paraphrase; such are all those which are by the grammarians termed expletives, and, in dead languages, are suffered to pass for empty sounds, of no other use than to fill a verse, or to modulate a period, but which are easily perceived in living tongues to have power and emphasis, though it be sometimes such as no other form of expression can convey.”
他还排除了许多其他类别,其中许多类别至今仍是常见的排除目标。
He excludes many other categories as well, many of which remain common exclusion targets today.
名称:“由于我的目的是编纂一本词典,无论是普通的还是称谓的,我都省略了所有与专有名词有关的词汇,例如阿里乌派、索西尼派、加尔文派、本笃会、伊斯兰教;但保留了更通用的词汇,例如异教徒、异教徒。”行话:“必须坦率地承认,许多关于工艺和制造的术语被省略了;但对于这个缺陷,我可以大胆地说,这是不可避免的:我无法去洞穴学习矿工的语言,无法乘船出海完善我的航海方言技能,也无法参观商人的仓库和工匠的店铺,去获取书籍中没有提到的商品、工具和操作的名称;是什么样的有利的意外或简单的调查带来了在我力所能及的范围内,并没有被忽视;但收集词汇却是一项徒劳无功的工作,我得去寻找鲜活的信息,还要与一个人的阴郁和另一个人的粗鲁较劲。”在我们的分析中,韦氏在线词典在医学术语方面的表现往往优于牛津英语词典,因为后者包含一个独立的、庞大的医学术语词典(未发表的数据)。外来词:“我们的作者由于他们对外语的了解,或是由于对自己语言的无知,或是由于虚荣或放纵,或是由于追随时尚,或是由于渴望创新而引入的词汇,我都按其出现的方式记录下来,但通常只是为了谴责他们,并警告其他人不要愚蠢地让无用的外国人归化,以免损害当地人的利益。”时尚:“并非所有词汇中没有的词都应该被哀叹为遗漏。对于勤劳和经商的人来说,他们的用词在很大程度上是随意和多变的;他们的许多术语是为了某种暂时或局部的便利而形成的,虽然在某些时间和地点很流行,但在其他地方却完全不为人知。这种转瞬即逝的黑话,总是处于增长或衰退的状态,不能被视为语言持久材料的任何一部分,因此必须与其他不值得保存的东西一起消亡。英语中存在各种各样的暗物质。
Names: “As my design was a dictionary, common or appellative, I have omitted all words which have relation to proper names; such as Arian, Socinian, Calvinist, Benedictine, Mahometan; but have retained those of a more general nature, as Heathen, Pagan.” Jargon: “That many terms of art and manufacture are omitted, must be frankly acknowledged; but for this defect I may boldly allege that it was unavoidable: I could not visit caverns to learn the miner’s language, nor take a voyage to perfect my skill in the dialect of navigation, nor visit the warehouses of merchants, and shops of artificers, to gain the names of wares, tools and operations, of which no mention is found in books; what favourable accident, or easy enquiry brought within my reach, has not been neglected; but it had been a hopeless labour to glean up words, by courting living information, and contesting with the sullenness of one, and the roughness of another.” In our analyses, Merriam-Webster’s online dictionary often outperforms the OED on medical jargon, because the latter includes a separate, vast dictionary of medical terms (unpublished data). Foreign words: “The words which our authours have introduced by their knowledge of foreign languages, or ignorance of their own, by vanity or wantonness, by compliance with fashion, or lust of innovation, I have registred as they occurred, though commonly only to censure them, and warn others against the folly of naturalizing useless foreigners to the injury of the natives.” Fads: “Nor are all words which are not found in the vocabulary, to be lamented as omissions. Of the laborious and mercantile part of the people, the diction is in a great measure casual and mutable; many of their terms are formed for some temporary or local convenience, and though current at certain times and places, are in others utterly unknown. This fugitive cant, which is always in a state of increase or decay, cannot be regarded as any part of the durable materials of a language, and therefore must be suffered to perish with other things unworthy of preservation.” There’s all sorts of dark matter in the English language.
参见塞缪尔·约翰逊,《英语词典》(伦敦,1755年);《韦氏大学词典》,第11版(马萨诸塞州斯普林菲尔德:韦氏出版社,2003年)。我们还推荐佩德罗·卡罗利诺,《英语原样》(纽约:阿普尔顿出版社,1883年)。
See Samuel Johnson, A Dictionary of the English Language (London, 1755); Merriam-Webster’s Collegiate Dictionary, 11th ed. (Springfield, MA: Merriam-Webster, 2003). We also recommend Pedro Carolino, English As She Is Spoke (New York: Appleton, 1883).
暗物质估算。我们从词典中抽取了一千个单词作为样本,并确定了有多少属于被排除的类别。因此,我们并没有一份包含所有英语暗物质的清单。就像宇宙中的暗物质一样,我们并不清楚它究竟是什么——只是知道它有很多。
Dark matter estimate. We took a sample of a thousand words from a lexicon and determined how many fell into excluded categories. As a consequence, we don’t have a list of all the English dark matter. Like the dark matter in the universe, we don’t know exactly what it is—just that there’s a lot of it.
四个生日和一个葬礼
Four Birthdays and a Funeral
年度词汇。请参阅美国方言学会发布的《1990年至今年度词汇汇总》,http://goo.gl/JCYMiK。
Word of the Year. See “All of the Words of the Year, 1990 to Present,” American Dialect Society, http://goo.gl/JCYMiK.
成功率最低。我们非常激动,能够击败“空中飞人”(乘皮划艇从飞机上跳下)获得这一殊荣。然而,考虑到空中飞人爱好者经常面临的致命危险,或许有强有力的进化论据可以证明,空中飞人确实不太可能成功。当然,ADS 的预测不应被轻信;到 2011 年, “文化组学”一词已被兰登书屋和麦克米伦词典收录。参见“文化组学”,麦克米伦在线词典,http://goo.gl/qkg8GE;“文化组学”,Dictionary.com,http://goo.gl/EmvAhE。
Least Likely to Succeed. We were thrilled to have beaten skyaking—jumping off a plane in a kayak—for this honor. It does seem to us, though, that given the mortal peril routinely faced by skyaking devotees, there might be a strong evolutionary argument that skyaking is indeed less likely to succeed. Of course, the ADS predictions should not be taken at face value; by 2011, culturomics had entered both the Random House and Macmillan dictionaries. See “Culturomics,” Macmillan Dictionary online, http://goo.gl/qkg8GE; “Culturomics,” Dictionary.com, http://goo.gl/EmvAhE.
图表。中间时间点的估计值基于线性插值。
Chart. Estimates for intermediate time points were based on linear interpolation.
语言发展和演变的原因。推测语言演变的确切原因,尤其是英语的未来,是一件很有趣的事情。参见迈克尔·埃拉德,《未来英语》(English As She Will Be Spoke),《新科学家》(New Scientist ) ,2008年3月29日;《英语即将到来》(English Is Coming),《经济学人》(Economist ),2009年2月12日,在线访问:http://goo.gl/wcPGt8。人们对这类事情的兴趣由来已久。参见约瑟夫·雅各布斯,《英语的发展——从新标准词典的45万词看英语的惊人发展》( New York Times),1913年11月16日。
Causes of language growth and change. It’s fun to speculate about the exact causes of language change, and about the future of the English language in particular. See Michael Erard, “English As She Will Be Spoke,” New Scientist, March 29, 2008; “English Is Coming,” Economist, February 12, 2009, online at http://goo.gl/wcPGt8. People have been interested in this sort of thing for a long time. See Joseph Jacobs, “Growth of English—Amazing Development of the Language as Shown in the New Standard Dictionary’s 450,000 Words,” New York Times, November 16, 1913.
爸爸,保姆从哪里来?
Daddy, where do babysitters come from?
二合一。有很多例子都体现了这种通过中间连字符从两个单词过渡到复合词的现象。我们不想用太多例子来让你感到困惑。例如,参见 NV:“rail road, rail-road, railroad”。
Two-to-one. There are many examples of this sort of transition from two words to a compound word by means of a hyphenated intermediate. We don’t want to railroad you with too many examples. See, for instance, NV: “rail road, rail-road, railroad.”
第四章 7.5分钟的名声
CHAPTER 4. 7.5 MINUTES OF FAME
别废话了
Cut the Crap
梵蒂冈秘密档案馆。 “Secret ”(秘密)一词指的是梵蒂冈秘密档案馆被视为教皇的个人财产。这并不是说这本书里并没有太多精彩内容,比如英国议会要求亨利八世离婚的照会、教皇下令将马丁·路德逐出教会,以及宣布“雌雄同体”的瑞典女王克里斯蒂娜退位的信函。好在近年来,大规模的编目工作让这本书更容易被找到。
Vatican Secret Archive. The word Secret—Segreto—refers to the fact that the Archivio Segreto Vaticano is regarded as the personal property of the pope. That’s not to say the place isn’t packed with juicy stuff, like a note from the English Parliament requesting a divorce for Henry VIII, the Papal Order excommunicating Martin Luther, and a letter announcing the abdication of the “hermaphrodite” Queen Christina of Sweden. Fortunately, in recent years, a massive cataloging effort has made its books a lot easier to find.
元数据质量。在信息丰富的博客“语言日志”(Language Log)上,你可以找到一个有趣但已经过时的帖子,探讨谷歌早期在图书元数据方面遇到的麻烦。请参阅Geoff Nunberg的《谷歌图书:一场元数据灾难》 ,“语言日志” ,2009年8月29日,http://goo.gl/AwNArh。自那时起,元数据质量已显著提高。
Metadata quality. An interesting, but now dated, thread on the early troubles Google encountered with book metadata can be found on the highly informative blog Language Log. See Geoff Nunberg, “Google Books: A Metadata Train Wreck,” Language Log, August 29, 2009, http://goo.gl/AwNArh. The metadata quality has been improved dramatically since that time.
用于提高元数据质量的算法。参见 Michel2011S。
Algorithms for improving metadata quality. See Michel2011S.
清洁先生
Mr. Clean
Ngrams 与人类基因组。基因组碱基调用质量的估计基于 Eric Lander 等人的论文《人类基因组的初始测序和分析》,《自然》 409,第 6822 期(2001 年):860–921,在线访问 http://goo.gl/trMZ4e。
Ngrams vs. the human genome. Estimates of genome base-call quality are based on Eric Lander et al., “Initial Sequencing and Analysis of the Human Genome,” Nature 409, no. 6822 (2001): 860–921, online at http://goo.gl/trMZ4e.
Ngrams 与法律。一个新兴的法律论点是,虽然提供数百万份受版权保护文本的数字副本供人们阅读(“消耗性”使用)构成侵犯版权,但允许查看使用相同受版权保护文本执行的计算输出(“非消耗性”使用)则不构成侵犯版权,只要输出不包含原文的大段内容。Ngrams 是书籍“非消耗性”使用的一个例子,我们在“作家协会等诉谷歌”案中向法院提交的一份法庭之友陈述中也提出了这一点。参见 Erez Lieberman-Aiden 和 Jean-Baptiste Michel 致法院的信函,2009 年 9 月 3 日(ECF 编号 303),美国作家协会等诉谷歌公司,770 F.Supp.2d 666(SDNY,2011 年 3 月 22 日)(编号 05-Civ.-8136)。
Ngrams and the law. One emerging legal argument is that, whereas providing digital copies of millions of copyrighted texts for people to read (“consumptive” use) is a breach of copyright, making it possible to see the output of computations performed using those same copyrighted texts (“nonconsumptive” uses) may not be, so long as the output doesn’t include long chunks of the original text. Ngrams are an example of a useful “nonconsumptive” use of books, a point we made in an amicus brief to the court in the case of The Authors Guild, Inc., et al., v. Google, Inc. See Letter from Erez Lieberman-Aiden and Jean-Baptiste Michel to Court, September 3, 2009 (ECF No. 303), The Authors Guild, Inc., et al., v. Google, Inc., 770 F.Supp.2d 666 (S.D.N.Y., March 22, 2011) (No. 05-Civ.-8136).
这一论点最近在The Authors Guild, Inc., et al. v. HathiTrust et al. (SDNY, 2012) 一案中获得了一定的法律支持。HathiTrust 数字图书馆可直接访问从参与图书馆获得的数百万本数字图书。这些图书通常由 Google 数字化。2012 年 10 月 10 日,纽约南区联邦地区法官 Harold Baer, Jr. 裁定 HathiTrust 胜诉。该裁决明确承认,对大量藏书进行“非消耗性”计算是对“科学进步和艺术培养的宝贵贡献”,并且此类收益“完全属于合理使用的保护范围”。为了支持这一观点,Baer 法官引用了 Matthew L. Jockers、Matthew Sag 和 Jason Schultz 提交的一份法庭之友陈述,我们也是该陈述的签署人;举一个具体的例子,法官引用了我们用来打开这本书的同一个词组:“作者使用‘is’而不是‘are’来指代美国的频率随时间的变化”。该裁决可在线访问 http://goo.gl/QESiv;它引用的法庭之友意见书是“数字人文和法律学者作为法庭之友部分支持被告的简易判决动议的意见书”,作者协会等诉HathiTrust 等,902 F.Supp.2d 445 (SDNY, 2012 年 10 月 10 日) (No. 11-Civ.-06351) 2012 WL 4808939。
This argument has gained some legal traction recently in the case of The Authors Guild, Inc., et al. v. HathiTrust et al. (S.D.N.Y., 2012). The HathiTrust Digital Library offers direct access to millions of digitized books obtained from participating libraries. Often these have been digitized by Google. On October 10, 2012, Hon. Harold Baer, Jr., a federal district judge in the Southern District of New York, ruled in favor of HathiTrust. The ruling specifically recognized that “nonconsumptive” computations over large collections of books constitute an “invaluable contribution to the progress of science and the cultivation of the arts” and that such benefits “fall safely within the protection of fair use.” To support this view, Judge Baer cited an amicus brief filed by Matthew L. Jockers, Matthew Sag, and Jason Schultz, on which we were also signatories; as a specific example, the judge referred to the same ngram we used to open this book: “the frequency with which authors used ‘is’ to refer to the United States rather than ‘are’ over time.” The ruling is online at http://goo.gl/QESiv; the amicus brief it cites is “Brief of Digital Humanities and Law Scholars as Amici Curiae in Partial Support of Defendants’ Motion for Summary Judgment,” The Authors Guild, Inc., et al., v. HathiTrust et al., 902 F.Supp.2d 445 (S.D.N.Y., October 10, 2012) (No. 11-Civ.-06351) 2012 WL 4808939.
名气能给你带来什么
What Fame Buys You
史蒂芬·平克。参见《科尔伯特报告》,2007年2月7日,第6:38页,http://goo.gl/iFMGCt。平克是《Michel2011》的合著者。
Steven Pinker. See The Colbert Report, 6:38, February 7, 2007, http://goo.gl/iFMGCt. Pinker was a coauthor on Michel2011.
名声的故事
The Story of Fame
地球上被谷歌搜索次数最多的人。参见“时代精神2010:世界如何搜索”,谷歌时代精神,2011年,http://goo.gl/OCpY2X。
Most Googled person on Earth. See “Zeitgeist 2010: How the World Searched,” Google Zeitgeist, 2011, http://goo.gl/OCpY2X.
莱特的东西
The Wright Stuff
“当我看到它时,我就知道了。”当您看到Jacobellis v. Ohio , 378 US 184 (1963)时,您就会知道
“I know it when I see it.” You’ll know it when you see Jacobellis v. Ohio, 378 U.S. 184 (1963).
风洞。参见 Wilbur Wright 等人的《威尔伯和奥维尔·莱特的论文》(新纽约:麦格劳希尔出版公司,2000 年);彼得·L·雅各布,《飞行机器的愿景:莱特兄弟与发明过程》(华盛顿特区:史密森学会出版社,1990 年);吉娜·哈格勒,《船舶和航天器建模:掌控海洋和天空的科学与艺术》(纽约:施普林格出版公司,2013 年)。
Wind tunnels. See Wilbur Wright et al., The Papers of Wilbur and Orville Wright (New York: McGraw-Hill, 2000); Peter L. Jakab, Visions of a Flying Machine: The Wright Brothers and the Process of Invention (Washington, DC: Smithsonian Institution Press, 1990); Gina Hagler, Modeling Ships and Space Craft: The Science and Art of Mastering the Oceans and Sky (New York: Springer, 2013).
《几近成名》
Almost Famous
迈克尔·斯蒂尔。该事件的视频刊登于Newsmax网站,题为《斯蒂尔在辩论中误提“最爱的书”》,2011年1月3日,http://goo.gl/8hh40。
Michael Steele. A video of the incident in question appears at “Steele Flubs ‘Favorite Book’ Reference During Debate,” Newsmax, January 3, 2011, http://goo.gl/8hh40.
卡罗尔·吉利根。参见安德拉·美狄亚著《卡罗尔·吉利根》,《犹太妇女:一部综合性历史百科全书》,http://goo.gl/LN2al。
Carol Gilligan. See Andra Medea, “Carol Gilligan,” Jewish Women: A Comprehensive Historical Encyclopaedia, http://goo.gl/LN2al.
像对待疾病一样对待名声
Treating Fame Like a Disease
队列研究方法。安德沃德 (Andvord) 1930 年原创研究的译文发表于 Kristian F. Andvord 的《通过追踪结核病的世代发展,我们能学到什么?》(国际结核病与肺部疾病杂志,第 6 卷,第 7 期 (2002),第 562-568 页)。有关经典队列研究的综述,请参阅 Richard Doll 的《队列研究:方法史》,《社会与预防医学》,第 46 卷,第 2 期 (2001),第 75-86 页,在线访问 http://goo.gl/dRJKCp。本章中的分析均基于 Michel2011 的研究,并在该研究和 Michel2011S 中进行了详细说明。
Cohort method. A translation of Andvord’s 1930 original research appears at Kristian F. Andvord, “What Can We Learn by Following the Development of Tuberculosis from One Generation to Another?” International Journal of Tuberculosis and Lung Disease 6, no. 7 (2002): 562–68. For a survey of classic cohort studies, see Richard Doll, “Cohort Studies: History of the Method,” Sozial- und Präventivmedizin 46, no. 2 (2001), 75–86, online at http://goo.gl/dRJKCp. The analyses in this chapter are all based on Michel2011 and are detailed there and in Michel2011S.
名人堂
The Hall of Fame
21758 Adrianveres。Adrian的母星的轨道周期为 3.47 个地球年。
21758 Adrianveres. Adrian’s homeworld has an orbital period of 3.47 Earth years.
7,500 名受害者。编制一份 1800 年至 1950 年间每年出生的 50 位最著名人物的名单涉及一系列重大技术障碍。其中一个主要问题是确定与某人姓名对应的 ngram 何时真正指的是该人。ngram Winston Churchill最有可能指的是 1874 年出生的政治家、他 1940 年出生的孙子、也叫 Winston Churchill 且出生于 1971 年的小说家,还是指这三者的难以区分的混合?为了解决这个问题,Veres 使用了大量上下文信息,例如将每位温斯顿·丘吉尔候选人的生日与 ngram 首次亮相进行比较,注意到“Winston Churchill”的维基百科页面默认重定向到 Winston1874 的页面,并观察到 Winston1874 比其他温斯顿·丘吉尔候选人获得更多的维基百科流量。这些标准以及其他标准已应用于数十万个姓名。详情请参阅 Michel2011S。
7,500 victims. Compiling a list of the fifty most famous people born each year between 1800 and 1950 involved a series of significant technical hurdles. One major problem was deciding when mentions of an ngram corresponding to the name of a person were actually referring to that person. Does the ngram Winston Churchill most likely refer to the statesman born in 1874, to his grandson born in 1940, to a novelist also named Winston Churchill and born in 1971, or to an impossible-to-disentangle mix of the three? To solve this problem, Veres used a great deal of contextual information, such as comparing the birthday of each Winston Churchill candidate with the ngram debut, noting the fact that the Wikipedia page for “Winston Churchill” redirects by default to the page of Winston1874, and observing that Winston1874 gets much more Wikipedia traffic than the other Winston Churchill candidates. These criteria and others were applied to hundreds of thousands of names. Read all about it at Michel2011S.
一群令人兴奋的人。后来,Veres 和《科学》 杂志记者 John Bohannon 利用 ngrams 构建了一个科学名人堂,囊括了当代被提及次数最多的科学家。他们以毫达尔文 (milliDarwin) 为单位计算了每位科学家的知名度。1 毫达尔文是达尔文知名度的千分之一。最著名的科学家竟然是伯特兰·罗素 (Bertrand Russell),他的反战立场使他备受争议。在世的科学家中,最著名的是诺姆·乔姆斯基 (Noam Chomsky),他的知名度为 507 毫达尔文。参见 Adrian Veres 和 John Bohannon 的《科学名人堂》,《科学》第 331 卷,第 6014 期(2011 年 1 月 14 日),在线访问 http://goo.gl/6g8b7X。
Exciting set of people. Later, Veres and Science journalist John Bohannon used ngrams to assemble a Science Hall of Fame comprising the most frequently mentioned contemporary scientists. They calculated the fame of each scientist in milliDarwins. One milliDarwin is one one-thousandth of the fame of Darwin. The most famous scientist turns out to be Bertrand Russell, whose antiwar positions made him the subject of great controversy. The most famous living scientist is Noam Chomsky, at 507 milliDarwins. See Adrian Veres and John Bohannon, “The Science Hall of Fame,” Science 331, no. 6014 (January 14, 2011), online at http://goo.gl/6g8b7X.
大统一理论
The Grandee Unified Theory
名声的动态。参见 Michel2011、Michel2011S。
The dynamics of fame. See Michel2011, Michel2011S.
如何成名:职业选择指南
How to Get Famous: A Guide to Choosing Your Career
名人焦点小组。 1800年至1920年间出生的各职业类别中最著名25位人物的完整名单可在Michel2011S上查阅。名单包括玛丽·居里(1867年,科学家)、马塞尔·杜尚(1887年,艺术家)、克劳德·香农(1916年,数学家)、汉弗莱·鲍嘉(1899年,演员)、弗吉尼亚·伍尔夫(1882年,作家)和温斯顿·丘吉尔(1874年,政治家)。
Celebrity focus group. The list of the twenty-five most famous people born between 1800 and 1920 in each of the career categories can be consulted in its entirety at Michel2011S. The list features Marie Curie (1867, scientist), Marcel Duchamp (1887, artist), Claude Shannon (1916, mathematician), Humphrey Bogart (1899, actor), Virginia Woolf (1882, author), and Winston Churchill (1874, politician).
关于名声。名声研究是社会学中一个成熟的领域。参见Leo Braudy著《名声的狂热:名声及其历史》(牛津:牛津大学出版社,1986年);Stanley Lieberson著《品味问题:姓名、时尚和文化如何变迁》(康涅狄格州纽黑文:耶鲁大学出版社,2000年)。
On fame. The study of fame is a well-established field of sociology. See Leo Braudy, The Frenzy of Renown: Fame and Its History (Oxford: Oxford University Press, 1986); Stanley Lieberson, A Matter of Taste: How Names, Fashions, and Culture Change (New Haven, CT: Yale University Press, 2000).
骂名
Infamy
查普曼的话。参见马克·塞奇,《查普曼枪杀列侬,‘窃取他的名声’》 , 《爱尔兰观察家报》 ,2004年10月19日,在线版,http://goo.gl/pLXl51。最近, 《滚石》杂志将波士顿马拉松爆炸案凶手之一焦哈尔·察尔纳耶夫的肖像刊登在封面上,引发了一场相关的争议。参见珍妮特·雷特曼,《贾哈尔的世界》,《滚石》,2013年7月17日,http://goo.gl/fyc8y。
Chapman’s words. See Mark Sage, “Chapman Shot Lennon to ‘Steal His Fame,’” Irish Examiner, October 19, 2004, online at http://goo.gl/pLXl51. A related controversy recently arose after Rolling Stone put a portrait of one of the Boston Marathon bombers, Dzhokhar Tsarnaev, on its cover. See Janet Reitman, “Jahar’s World,” Rolling Stone, July 17, 2013, http://goo.gl/fyc8y.
人类的一次巨大飞跃
One giant leapfrog for mankind
美国英雄。 如果你知道执行这次任务的第三名宇航员——阿姆斯特朗和奥尔德林在月球表面时,他乘坐指令舱绕月飞行——名叫迈克尔·柯林斯,请举手。
American heroes. Raise your hand if you knew that the third astronaut on the mission—who orbited the moon in the command module while Armstrong and Aldrin were on the surface—was named Michael Collins.
第五章 寂静之声
CHAPTER 5. THE SOUND OF SILENCE
简介
Intro
“Dort wo man Bücher verbrennt。”请参阅海因里希·海涅 (Heinrich Heine), Almansor,载于卡尔·阿道夫·布赫海姆 (Carl Adolf Buchheim) 编的《海因里希·海涅全集》(Heinrich Heine's Gesammelte Werke )(柏林:G. Grote,1887 年);译文改编自斯蒂芬·J·惠特菲尔德 (Stephen J. Whitfield) 的《他们焚书的地方》, 《现代犹太教》 22,第 3 期(2002 年):213–33,在线网址为 http://goo.gl/YbmMU3。今天,这段话出现在米哈·乌尔曼 (Micha Ullman) 设计的位于柏林公共广场倍倍尔广场的一座纪念碑上。1933 年焚书事件中,约瑟夫·戈培尔带领暴徒在这里焚烧了两万多本书。纪念碑是广场上的一块半透明的窗格,透过它,旁观者可以看到足以容纳两万本书的空书架。你可以在 http://goo.gl/SYzu4 看到碑文。倍尔广场铭文中出现的曼索尔 (Almansor)段落
“Dort wo man Bücher verbrennt.” See Heinrich Heine, Almansor, in Heinrich Heine’s Gesammelte Werke, ed. Carl Adolf Buchheim (Berlin: G. Grote, 1887); translation adapted from Stephen J. Whitfield, “Where They Burn Books,” Modern Judaism 22, no. 3 (2002): 213–33, online at http://goo.gl/YbmMU3. Today, this passage appears in a memorial designed by Micha Ullman at Bebelplatz, a public square in Berlin, on the site where, during the 1933 book burnings, Joseph Goebbels led a mob in burning more than twenty thousand books. The memorial is a translucent pane in the square, through which onlookers can see enough empty bookshelves to accommodate twenty thousand books. You can see the inscription at http://goo.gl/SYzu4. Note that the passage from Almansor, as it appears in the Bebelplatz inscription, contains a typographic error.
海伦·凯勒的信。这封信的抄本,经凯勒的一位助手修改,让我们得以一窥最终版本的编辑过程。该抄本现收藏于美国盲人基金会,详情可参阅海伦·塞尔斯登的《海伦·凯勒的话语:80年后……依然如此有力》,美国盲人基金会,2013年5月9日,http://goo.gl/uSSE8。
Helen Keller’s letter. A transcript of the letter, with changes written in by one of Keller’s aides, gives insight into the editing process that led to the final version. It is in the collections of the American Foundation for the Blind, and can be seen at Helen Selsdon, “Helen Keller’s Words: 80 Years Later . . . Still as Powerful,” American Foundation for the Blind, May 9, 2013, http://goo.gl/uSSE8.
这些注释已在 Rebecca Onion 的《上帝不沉睡:海伦·凯勒写给焚书的德国学生的一封措辞严厉的信》一文中讨论过,Slate 杂志,2013 年 5 月 16 日,http://goo.gl/SxdG2。
The annotations are discussed at Rebecca Onion, “‘God Sleepeth Not’: Helen Keller’s Blistering Letter to Book-Burning German Students,” Slate, May 16, 2013, http://goo.gl/SxdG2.
审查制度。参见V. Gregorian编,《审查制度:500年的冲突》(纽约:纽约公共图书馆,1984年)。
Censorship. See V. Gregorian, ed., Censorship: 500 Years of Conflict (New York: New York Public Library, 1984).
彩色玻璃窗
A Stained-Glass Window
“去找一本书。”参见 Jacob Baal-Teshuva,《夏加尔:1887–1985》(德国科隆:Taschen,2003 年),第 16 页。
“Go and find a book.” See Jacob Baal-Teshuva, Chagall: 1887–1985 (Cologne, Germany: Taschen, 2003), 16.
Móyshe Shagal。虽然他最终采用的名字 Marc Chagall 到 1910 年就已广为人知,但他早期曾使用过许多其他名字:Movsha Khatselev、Mark Zakharovich 和 Movsha Shagalov。请参阅 Benjamin Harshav 的《马克·夏加尔和他的时代:纪实叙事》(加州帕洛阿尔托:斯坦福大学出版社,2004 年),第 63 页。有关他的生活和艺术的有用书籍包括上述的 Baal-Teshuva;Jackie Wullschlager 的《夏加尔:传记》(纽约:Alfred A. Knopf,2008 年);马克·夏加尔的《耶路撒冷之窗》,Jean Leymarie 译(纽约:George Braziller,1967 年);马克·夏加尔的《我的生活》,Elisabeth Abbott 译(纽约:Da Capo Press,1994 年)。
Móyshe Shagal. Although the name he ultimately adopted, Marc Chagall, was well established by 1910, he was known by many other names early on: Movsha Khatselev, Mark Zakharovich, Movsha Shagalov. See Benjamin Harshav, Marc Chagall and His Times: A Documentary Narrative (Palo Alto, CA: Stanford University Press, 2004), 63. Useful volumes about his life and art include Baal-Teshuva, above; Jackie Wullschlager, Chagall: A Biography (New York: Alfred A. Knopf, 2008); Marc Chagall, The Jerusalem Windows, trans. Jean Leymarie (New York: George Braziller, 1967); Marc Chagall, My Life, trans. Elisabeth Abbott (New York: Da Capo Press, 1994).
“典型的犹太艺术家。”参见罗伯特·休斯的《现代主义屋顶上的提琴手》,《时代》杂志,2001 年 6 月 24 日,http://goo.gl/aFMsU。
“The quintessential Jewish artist.” See Robert Hughes, “Fiddler on the Roof of Modernism,” Time, June 24, 2001, http://goo.gl/aFMsU.
“当马蒂斯去世时。”参见弗朗索瓦丝·吉洛和卡尔顿·莱克合著《与毕加索的生活》(纽约:麦格劳-希尔,1964年),第258页。吉洛是毕加索的爱人和缪斯。她指出,尽管毕加索与夏加尔之间存在一些私人恩怨,但他仍然对夏加尔的艺术怀有无比的敬意。完整引文如下: “当马蒂斯去世时,夏加尔将是唯一一位真正理解色彩的画家。我并不热衷于那些鸡巴、屁股、飞翔的小提琴手以及所有民间传说,但他的画布是真正用心描绘的,而非随意拼凑而成。他最后在威尼斯创作的一些作品让我相信,自雷诺阿以来,再无人能像夏加尔那样对光线有如此敏锐的感知。”
“When Matisse dies.” See Françoise Gilot and Carlton Lake, Life with Picasso (New York: McGraw-Hill, 1964), 258. Gilot was Picasso’s lover and muse. She notes that, although Picasso had some personal issues with Chagall, he nonetheless had immense respect for Chagall’s art. The full quotation is: “When Matisse dies, Chagall will be the only painter left who understands what color really is. I’m not crazy about those cocks and asses and flying violinists and all the folklore, but his canvases are really painted, not just thrown together. Some of the last things he’s done in Venice convince me that there’s never been anybody since Renoir who has the feeling for light that Chagall has.”
视觉艺术委员。参见Wullschlager,第223页。
Commissar for the visual arts. See Wullschlager, 223.
“我担心我的‘形象’。 ”参见 Harshav,326–27。
“I’m afraid that my ‘image.’” See Harshav, 326–27.
图表。NV:“Chagall”/法语,“”/俄语。
Chart. NV: “Chagall”/French, “”/Russian.
堕落艺术
Degenerate Art
马克斯·诺尔道。他对堕落艺术的看法体现在两卷本的《堕落》 ( Entartung,柏林:卡尔·邓德出版社,1892-1893年)。纳粹对这一概念的使用显然与诺尔道更广泛的观点截然相反。例如,参见马克斯·诺尔道和古斯塔夫·戈塞尔合著的《犹太复国主义与反犹太主义》(纽约:福克斯出版社,达菲尔德出版社,1905年);马克斯·诺尔道和安娜·诺尔道合著的《马克斯·诺尔道:传记》(蒙大拿州怀特菲什:凯辛格出版社,2007年)。诺尔道是前六届世界犹太复国主义大会的副主席(西奥多·赫茨尔为主席),也是接下来四届大会的主席。参见斯宾塞·C·塔克编著的《阿以冲突百科全书》 (圣巴巴拉:ABC-CLIO出版社,2008年)。
Max Nordau. His views on degenerate art appear in the two-volume Entartung [Degeneration] (Berlin: Carl Dunder Verlag, 1892–1893). The Nazi usage of this concept was obviously a 180-degree reversal of Nordau’s broader views. See, for instance, Max Nordau and Gustav Gottheil, Zionism and Anti-Semitism (New York: Fox, Duffield, 1905); Max Nordau and Anna Nordau, Max Nordau: A Biography (Whitefish, MT: Kessinger, 2007). Nordau was vice president of the first six World Zionist Congresses (Theodor Herzl was president), and president of the next four. See “Max Nordau,” The Encyclopedia of the Arab-Israeli Conflict, ed. Spencer C. Tucker (Santa Barbara, CA: ABC-CLIO, 2008).
对德国文化的严酷管控。参见理查德·A·埃特林,《第三帝国下的艺术、文化与媒体》 (芝加哥:芝加哥大学出版社,2002年);格伦·R·库莫主编, 《国家社会主义文化政策》(纽约:圣马丁出版社,1995年);艾伦·E·施泰因韦斯,《纳粹德国的艺术、意识形态与经济学》(教堂山:北卡罗来纳大学出版社,1993年);乔纳森·佩特罗普洛斯,《浮士德式的交易》(纽约:牛津大学出版社,2000年)。
Draconian control of German culture. See Richard A. Etlin, Art, Culture, and Media Under the Third Reich (Chicago: University of Chicago Press, 2002); Glenn R. Cuomo, ed., National Socialist Cultural Policy (New York: St. Martin’s Press, 1995); Alan E. Steinweis, Art, Ideology, and Economics in Nazi Germany (Chapel Hill: University of North Carolina Press, 1993); Jonathan Petropoulos, The Faustian Bargain (New York: Oxford University Press, 2000).
“在未来,只有那些。”彼得·亚当,《第三帝国的艺术》(纽约:哈里·N·艾布拉姆斯,1992 年),53。
“In the future, only those.” Peter Adam, Art of the Third Reich (New York: Harry N. Abrams, 1992), 53.
《呐喊》。博物馆没有同意。参见马西·奥斯特,《纳粹掠夺的《呐喊》所有者继承人要求现代艺术博物馆就其展出作出解释》,犹太电讯社,2012年10月15日,http://goo.gl/gBmtL。
The Scream. The museum did not agree to do so. See Marcy Oster, “Heirs of Owner of Nazi-Looted ‘The Scream’ Want Explanation on Display at MoMA,” Jewish Telegraphic Agency, October 15, 2012, http://goo.gl/gBmtL.
史上最受欢迎的艺术展
The Most Popular Art Exhibit of All Time
“德国人民。” 译文取自尼尔·列维 (Neil Levi) 的《‘自己判断!’——作为政治奇观的‘堕落艺术’展览》,《十月》第 85 卷(1998 年):41–64 页,在线网址为 http://goo.gl/CfuBMt。
“German Volk.” The translation is drawn from Neil Levi, “‘Judge for Yourselves!’—The ‘Degenerate Art’ Exhibition as Political Spectacle,” October 85 (1998): 41–64, online at http://goo.gl/CfuBMt.
堕落的艺术。 1991年,斯蒂芬妮·巴伦(Stephanie Barron)策划了《堕落的艺术》 的重建展,并在洛杉矶郡立艺术博物馆展出。她为此次展览创作的图录是一项宝贵的学术贡献。参见斯蒂芬妮·巴伦主编,《堕落的艺术:纳粹德国先锋派的命运》(洛杉矶:洛杉矶郡立艺术博物馆,1991年)。
Entartete Kunst. In 1991, Stephanie Barron curated a reconstruction of Entartete Kunst for an exhibition at the Los Angeles County Museum of Art. The catalog she created for this exhibition is an invaluable scholarly contribution. See Stephanie Barron, ed., Degenerate Art: The Fate of the Avant-garde in Nazi Germany (Los Angeles: Los Angeles County Museum of Art, 1991).
“我感到一种压倒性的幽闭恐惧感。”这句话出自彼得·冈瑟(Peter Guenther)的散文《慕尼黑的三天,1937年7月》(Three Days in Munich, July 1937),收录于《巴伦周刊》图录。这份引人入胜的文献记录了冈瑟17岁时参观大德意志艺术中心(Große Deutsche Kunstausstellung)和堕落艺术中心(Entartete Kunst)的经历。参见同上,第38页。
“I felt an overwhelming sense of claustrophobia.” The quote is from “Three Days in Munich, July 1937,” an essay by Peter Guenther that appears in Barron’s catalog. This fascinating document describes Guenther’s visits to the Große Deutsche Kunstausstellung and Entartete Kunst as a seventeen-year-old. See ibid., 38.
最受欢迎的艺术展览。仅在1937年8月2日,就有三万六千人参观了“Entartete Kunst”(堕落艺术展)。为了了解参观人数的规模,可以参考《艺术新闻》(www.theartnewspaper.com)过去十年全球展览参观人数的统计数据。20XX年的统计数据可在http://www.theartnewspaper.com/attfig/attfigXX.pdf上查阅。值得注意的是,只有其中一项列出的展品超过了《醉人的艺术》在开幕前四个月的日均参观人数。唯一的例外是 2009 年在日本奈良举办的圣武天皇(701-756 年)和光明皇后(701-760 年)的正仓院宝藏展,该展览的日均参观人数达到 17,926 人。然而,该展览只展出了大约两周,因此总参观人数略多于 25 万,只是《醉人的艺术》参观人数的一小部分。一般来说,有些展览在极短的时间内参观人数非常多,但没有一个能与《醉人的艺术》所获得的持续关注相提并论。Barron, 9 明确指出“任何其他现代艺术展都无法与《醉人的艺术》的受欢迎程度相提并论”;虽然我们显然没有历史上每个艺术展的参观人数数据,但根据现有数据,我们觉得这种说法很有道理。
Most popular art exhibition. On August 2, 1937, alone, thirty-six thousand people attended Entartete Kunst. To give a sense of how massive this turnout was, it’s useful to examine worldwide exhibition attendance statistics that are conveniently available from the Art Newspaper (www.theartnewspaper.com) for the past ten years. The statistics for 20XX are available at http://www.theartnewspaper.com/attfig/attfigXX.pdf. Notably, only one of the exhibits listed exceeded the daily average attendance of Entartete Kunst over the latter’s first four months. The exception was a 2009 exhibition of the Shoso-In treasures of Emperor Sho¯mu (701–756) and Empress Ko¯myo¯ (701–760) in Nara, Japan, which sustained an average daily attendance of 17,926 people. However, the exhibit was only up for about two weeks, and thus total attendance, at a little more than a quarter million people, was a small fraction of the attendance of Entartete Kunst. In general, there are some shows with very high attendance over an extremely brief period of time, but none that comes close to matching the sustained interest achieved by Entartete Kunst. The claim that “the popularity of Entartete Kunst has never been matched by any other exhibition of modern art” is explicitly made in Barron, 9; although we obviously do not have attendance figures for every art exhibition in history, this claim seems very plausible to us based on available figures.
埃米尔·诺尔德。诺尔德是纳粹党的支持者,但由于希特勒拒绝表现主义,他成为了攻击目标。
Emil Nolde. Nolde was a supporter of the Nazi Party, but was nevertheless a target because of Hitler’s rejection of Expressionism.
焚书
Book Burnings
海报。海报可以在http://goo.gl/bNK9H上看到。
Posters. The poster can be seen at http://goo.gl/bNK9H.
“我们想要尊重。”此翻译来自《1932-1939 年禁书清单》,亚利桑那大学,2002 年 6 月 22 日,http://goo.gl/PMVRy。
“We want to regard.” This translation is from “List of Banned Books, 1932–1939,” University of Arizona, June 22, 2002, http://goo.gl/PMVRy.
黑名单。黑名单的详细信息参见 W. Treß, Wider den Undeutschen Geist: Bücherverbrennung 1933 (Berlin: Parthas, 2003); G. Sauder, Die Bücherverbrennung:10。Mai 1933(法兰克福:Ullstein,1985);和Liste des Schädlichen und Unerwünschten Schrifttums(莱比锡:Hedrich,1938 年)。
Blacklists. The blacklists are detailed in W. Treß, Wider den Undeutschen Geist: Bücherverbrennung 1933 (Berlin: Parthas, 2003); G. Sauder, Die Bücherverbrennung: 10. Mai 1933 (Frankfurt am Main: Ullstein, 1985); and Liste des Schädlichen und Unerwünschten Schrifttums (Leipzig: Hedrich, 1938).
与W. Treß以及柏林市政府网站(berlin.de)的沟通,为我们创建黑名单的数字版本提供了巨大的帮助。http://goo.gl/0ig7Ig上有一个非常有用的时间线。
Communications with W. Treß and the City of Berlin Web site (berlin.de) provided us with immense assistance in creating digital versions of the blacklists. A very helpful timeline appears at http://goo.gl/0ig7Ig.
玛格丽特·斯蒂格·道尔顿的作品。参见 Margaret F. Stieg,《纳粹德国的公共图书馆》(塔斯卡卢萨:阿拉巴马大学出版社,1992 年)和 Alan E. Steinweis,《纳粹德国公共图书馆评论》,作者:Margaret F. Stieg,DigitalCommons@内布拉斯加大学林肯分校,1992 年 4 月 1 日,http://goo.gl/atlK2t。
The work of Margaret Stieg Dalton. See Margaret F. Stieg, Public Libraries in Nazi Germany (Tuscaloosa: University of Alabama Press, 1992) and Alan E. Steinweis, review of Public Libraries in Nazi Germany, by Margaret F. Stieg, DigitalCommons@University of Nebraska-Lincoln, April 1, 1992, http://goo.gl/atlK2t.
他们不想让你知道的事:世界巡演
What They Don’t Want You to Know: A World Tour
俄罗斯的镇压。参见罗伯特·瑟维斯,《斯大林:传记》(马萨诸塞州剑桥:哈佛大学出版社,2004年)。斯大林不仅设法将对手从文字记录中抹去。例如,他还非常积极地将对手从照片中抹去。参见大卫·金,《消失的政委》(纽约:大都会图书出版社,1997年);约瑟夫·吉布斯,《戈尔巴乔夫的公开性》(大学城:德克萨斯农工大学出版社,1999年)。
Suppression in Russia. See Robert Service, Stalin: A Biography (Cambridge, MA: Harvard University Press, 2004). Stalin didn’t just manage to edit rivals out of the textual record. He was also, for instance, very aggressive about having his rivals doctored out of photographs. See David King, The Commissar Vanishes (New York: Metropolitan Books, 1997); Joseph Gibbs, Gorbachev’s Glasnost (College Station: Texas A&M University Press, 1999).
NV:“ ,,”/俄语(平滑度 = 1)
。
NV: “, , ”/Russian (smoothing = 1).
NV:“天安门”/英文,“ ”/中文(平滑度 = 0)。请注意,坐标轴的尺度不同;确切的查询是:“天安门:eng_2012 * 10,
:chi_sim_2012”。1950 年之前的虚假峰值是由于中文语料库中该日期之前撰写的书籍数量较少。中文资料倾向于将这些事件称为“六四事件”
。事实上,NV:“
”/中文在预期时间出现了上升;然而,这并不奇怪,因为这个短语在 1989 年之前没有指称。
NV: “Tiananmen”/English, “”/Chinese (smoothing = 0). Note that the axes are on different scales; the exact query is: “Tiananmen:eng_2012 * 10, :chi_sim_2012.” Spurious peaks prior to 1950 are due to the small number of books written before that date in the Chinese corpus. Chinese sources tend to refer to these events as “the June 4th Incident,” . Indeed, NV: “ ”/Chinese shows a rise at the expected time; however, this is not surprising, given that this phrase has no referent before 1989.
好莱坞十人。有关好莱坞十人的肖像,请参阅伯纳德·F·迪克的《彻底的纯真》(列克星敦:肯塔基大学出版社,1988年);杰拉尔德·霍恩的《黑名单的最后受害者》(伯克利:加州大学出版社,2006年);爱德华·德米特里克的自传《奇人》 (卡本代尔:南伊利诺伊大学出版社,1996年);以及约翰·贝里执导于1950年的精彩纪录片《好莱坞十人》 。
The Hollywood Ten. For portraits of the Hollywood Ten, see Bernard F. Dick, Radical Innocence (Lexington: University Press of Kentucky, 1988); Gerald Horne, The Final Victim of the Blacklist (Berkeley: University of California Press, 2006); the autobiographical Edward Dmytryk, Odd Man Out (Carbondale: Southern Illinois University Press, 1996); and the remarkable documentary film The Hollywood Ten, directed by John Berry, 1950.
“直到他被宣判无罪为止。”沃尔多夫声明全文如下摘自威廉·T·沃克,《麦卡锡主义与红色恐慌》(加州圣巴巴拉:ABC-CLIO,2011 年),第 136 页。
“Until such time as he is acquitted.” The full text of the Waldorf Statement appears in William T. Walker, McCarthyism and the Red Scare (Santa Barbara, CA: ABC-CLIO,2011), 136.
“当今美国最不美国化的东西。”参见乔纳森·奥尔巴赫,《黑暗边界》(北卡罗来纳州达勒姆:杜克大学出版社,2011年),第4页。
“Most un-American thing in the country today.” See Jonathan Auerbach, Dark Borders (Durham, NC: Duke University Press, 2011), 4.
出埃及记。参见奥托·普雷明格执导的《出埃及记》 ,1960年。
Exodus. See Exodus, directed by Otto Preminger, 1960.
天安门广场。有关这场屠杀的更多信息,请参阅赵鼎新著《天安门的力量》(芝加哥:芝加哥大学出版社,2001年);斯科特·西米和鲍勃·尼克松著《天安门广场》(西雅图:华盛顿大学出版社,1990年);菲利普·J·坎宁安著《天安门之月》(马里兰州拉纳姆:罗曼与利特尔菲尔德出版社,2009年);蒂莫西·布鲁克著《平息人民》(加州帕洛阿尔托:斯坦福大学出版社,1992年)。
Tiananmen Square. For more about the massacre, see Dingxin Zhao, The Power of Tiananmen (Chicago: University of Chicago Press, 2001); Scott Simmie and Bob Nixon, Tiananmen Square (Seattle: University of Washington Press, 1990); Philip J. Cunningham, Tiananmen Moon (Lanham, MD: Rowman & Littlefield, 2009); Timothy Brook, Quelling the People (Palo Alto, CA: Stanford University Press, 1992).
“中国的防火长城”。参见萧强和索菲·比奇合著的《中国的防火长城》,《圣彼得堡时报》 ,2002 年 9 月 3 日;以及《防火长城:隐藏的艺术》,《经济学人》 ,2013 年 4 月 6 日,http://goo.gl/VTV3b。
“Great Firewall of China.” See Xiao Qiang and Sophie Beach, “The Great Firewall of China,” St. Petersburg Times, September 3, 2002; “The Great Firewall: The Art of Concealment,” Economist, April 6, 2013, http://goo.gl/VTV3b.
中国对谷歌等搜索引擎的审查,在某种程度上,让我们回想起了索引或卡片目录的概念。如果你无法清除图书馆的内容(在这个比喻中,通过关闭整个互联网),你可以通过删除索引或卡片目录(帮助你找到感兴趣的页面或词语的搜索引擎)来有效地限制访问。更多关于中国对谷歌的审查以及谷歌的审查,请参阅英国广播公司(BBC),2006年1月25日,http://goo.gl/Xyd1ua;迈克尔·怀恩斯(Michael Wines),《谷歌将提醒用户注意中国审查》,《纽约时报》,2012年6月1日,http://goo.gl/7QmrQ;乔什·哈利迪(Josh Halliday),《谷歌撤回反审查警告,标志着在中国悄然败退》,《卫报》,2013年1月7日,http://goo.gl/aA2HU。
The Chinese effort to censor search engines such as Google brings us back, in some ways, to the notion of a concordance or card catalog. If you can’t get rid of the contents of the library (in this analogy, by shutting down the entire Internet), you can effectively restrict access by eliminating the concordance or the card catalog (search engines that help you find the page or word you’re interested in.) For more about censorship of and by Google in China, see “Google Censors Itself for China,” BBC, January 25, 2006, http://goo.gl/Xyd1ua; Michael Wines, “Google to Alert Users to Chinese Censorship,” New York Times, June 1, 2012, http://goo.gl/7QmrQ; Josh Halliday, “Google’s Dropped Anti-Censorship Warning Marks Quiet Defeat in China,” Guardian, January 7, 2013, http://goo.gl/aA2HU.
欲了解更多关于中国对天安门广场大屠杀的互联网审查,请参阅乔纳森·凯曼(Jonathan Kaiman)的文章《中国当局审查天安门广场在线搜索》,《卫报》,2013年6月4日,http://goo.gl/60SIo;马特·斯基亚文扎(Matt Schiavenza)的文章《中国如何让天安门广场大屠杀变得无关紧要》,《大西洋月刊》,2013年6月4日,http://goo.gl/d7Ccw。欲了解更多关于坦克人的信息,请参阅帕特里克·威蒂(Patrick Witty)的文章《幕后:天安门坦克人》,《纽约时报》,2009年6月3日,http://goo.gl/IvhdX。
For more about Chinese Internet censorship of the Tiananmen Square massacre, see Jonathan Kaiman, “Tiananmen Square Online Searches Censored by Chinese Authorities,” Guardian, June 4, 2013, http://goo.gl/60SIo; Matt Schiavenza, “How China Made the Tiananmen Square Massacre Irrelevant,” Atlantic, June 4, 2013, http://goo.gl/d7Ccw. For more about the Tank Man, see Patrick Witty, “Behind the Scenes: Tank Man of Tiananmen,” New York Times, June 3, 2009, http://goo.gl/IvhdX.
或许最有说服力的见解来自于询问中国年轻一代对这一事件的了解程度、他们何时得知这一事件以及如何得知这一事件,例如“中国天安门一代的发言”,BBC,2009 年 5 月 28 日,http://goo.gl/ms7x2,以及“中国学生不知道有‘坦克人’”,前线,视频,2:37,2008 年 7 月 27 日,http://goo.gl/Jf0Hy。
Perhaps the most telling insights come from asking members of younger generations in China what they know about the incident, when they learned about it, and how they found out, as in “China’s Tiananmen Generation Speaks,” BBC, May 28, 2009, http://goo.gl/ms7x2, and in “Chinese Students Unaware of the ‘Tank Man,’” Frontline, video, 2:37, July 27, 2008, http://goo.gl/Jf0Hy.
我们能自动检测审查吗?
Can We Detect Censorship Automatically?
检测审查。详情请参阅 Michel2011 和 Michel2011S。
Detecting censorship. See Michel2011 and Michel2011S for details.
渗透百万渠道
Seeping Through a Million Channels
嘲讽也可能是一种营销手段。纳粹在“堕落艺术”(Entartete Kunst)展览之后,又举办了爵士乐、犹太歌曲和其他堕落音乐(Entartete Musik)的音乐会,他们对这种颠覆形式愈发担忧,怀疑现场听众其实是这些音乐的粉丝。参见迈克尔·哈斯(Michael Haas)著《禁忌音乐》(Forbidden Music)(康涅狄格州纽黑文:耶鲁大学出版社,2013年);“第三帝国的音乐”,《音乐与大屠杀》,http://goo.gl/OlNcwZ。
Mocking can be marketing. When the Nazis followed up the Entartete Kunst exhibit with concerts featuring jazz, Jewish songs, and other entartete Musik, they became increasingly worried about this form of subversion, suspecting that the listeners in attendance were actually coming because they were fans of the music. See Michael Haas, Forbidden Music (New Haven, CT: Yale University Press, 2013); “Music in the Third Reich,” Music and the Holocaust, http://goo.gl/OlNcwZ.
夏洛特·萨洛蒙。参见夏洛特·萨洛蒙,《生活?还是戏剧?》(Leila Vennewitz 译,纽约:Viking出版社,1981年);玛丽·洛温纳·费尔斯蒂纳,《描绘她的人生》(纽约:Harper Perennial出版社,1995年);迈克尔·P·斯坦伯格和莫妮卡·博姆-杜钦,《阅读夏洛特·萨洛蒙》(伊萨卡,纽约州:康奈尔大学出版社,2006年)。
Charlotte Salomon. See Charlotte Salomon, Life? or Theater?, trans. Leila Vennewitz (New York: Viking, 1981); Mary Lowiner Felstiner, To Paint Her Life (New York: Harper Perennial, 1995); Michael P. Steinberg and Monica Bohm-Duchen, Reading Charlotte Salomon (Ithaca, NY: Cornell University Press, 2006).
“安妮·弗兰克日记的图画版。”参见《对生命价值的深刻警示》,《圣彼得堡时报》,1963年10月6日。
“The pictorial counterpart of Anne Frank’s diary.” See “A Poignant Reminder of the Value of Life,” St. Petersburg Times, October 6, 1963.
“如此温柔。”这句话出自夏洛特的继母保拉·萨洛蒙-林德伯格之口。参见费尔斯蒂纳,第228页。
“So tenderly.” The quote is from Paula Salomon-Lindberg, Charlotte’s stepmother. See Felstiner, 228.
两项权利构成另一项权利
Two rights make another right
这个 ngram最初由 Steven Pinker 指出,并在 Steven Pinker 的《人性中的善良天使:暴力为何减少》(纽约:Viking,2011 年)中进行了更详细的讨论。
This ngram was originally pointed out by Steven Pinker, and is discussed in greater detail in Steven Pinker, The Better Angels of Our Nature: Why Violence Has Declined (New York: Viking, 2011).
第六章 记忆的持久性
CHAPTER 6. THE PERSISTENCE OF MEMORY
简介
Intro
维也纳学派。参见托马斯·于贝尔,《维也纳学派》,《斯坦福哲学百科全书》 (2012年夏季出版);阿尔弗雷德·J·艾耶尔,《逻辑实证主义》(伊利诺斯州格伦科:自由出版社,1959年出版);弗里德里希·魏斯曼等,《维特根斯坦与维也纳学派》(牛津:巴兹尔·布莱克威尔出版社,1979年出版);以及大卫·埃德蒙兹和约翰·艾迪诺,《维特根斯坦的扑克牌》(纽约:爱可出版社,2001年出版)。
Vienna Circle. See Thomas Uebel, “Vienna Circle,” The Stanford Encyclopedia of Philosophy (Summer 2012); Alfred J. Ayer, Logical Positivism (Glencoe, IL: Free Press, 1959); Friedrich Weismann et al., Wittgenstein and the Vienna Circle (Oxford: Basil Blackwell, 1979); and David Edmonds and John Eidinow, Wittgenstein’s Poker (New York: Ecco, 2001).
反对 “民族精神”一词。参见 Verein Ernst Mach,《Wissenschaftliche Weltauffassung:Der Wiener Kreis》(维也纳:Artur Wolf,1929 年)。
Opposition to the term Volksgeist. See Verein Ernst Mach, Wissenschaftliche Weltauffassung: Der Wiener Kreis (Vienna: Artur Wolf, 1929).
记忆测试
Memory Test
艾宾浩斯。参见赫尔曼·艾宾浩斯著《记忆:对实验心理学的贡献》(Hermann Ebbinghaus, Memory: A Contribution to Experimental Psychology),亨利·鲁格和克拉拉·布森尼斯译(1885年;纽约:哥伦比亚大学教师学院,1913年)。威廉·詹姆斯对这部著作的溢美之词可参见威廉·詹姆斯的《散文、评论与综述》(William James, Essays, Comments and Reviews)(马萨诸塞州剑桥:哈佛大学出版社,1987年)。艾宾浩斯虽然是实验心理学的先驱,但他并非实验心理学的先驱;艾宾浩斯之前的重要人物包括威廉·冯特(通常被认为是实验心理学之父),以及上文提到的威廉·詹姆斯(通常被认为是美国心理学之父)。
Ebbinghaus. See Hermann Ebbinghaus, Memory: A Contribution to Experimental Psychology, trans. Henry Ruger and Clara Bussenius (1885; New York: Teachers College, Columbia University, 1913). William James’ glowing review of this work can be found in William James, Essays, Comments and Reviews (Cambridge, MA: Harvard University Press, 1987). Although Ebbinghaus was a pioneer of experimental psychology, he was not among the very first wave; significant figures predating Ebbinghaus include Wilhelm Wundt, often regarded as the father of experimental psychology, and William James, mentioned above, often regarded as the father of American psychology.
难忘
Unforgettable
图表。NV:“卢西塔尼亚号、珍珠港事件、水门事件”(平滑度 = 0)。
Chart. NV: “Lusitania, Pearl Harbor, Watergate” (smoothing = 0).
换个名字的记忆
A Memory by Any Other Name
1876年。给定数字或给定数值在文本中出现的概率并非均匀分布。相反,它遵循一种厚尾分布——在某些方面类似于幂律——称为本福特定律。例如,请参阅Theodore P. Hill,《有效数字定律的统计推导》,《统计科学》第10卷,第4期(1995年11月):第354-63页,在线访问:http://goo.gl/hLtUvm。
1876. The probability that a given digit or that a given number appears in a text is not uniform. Instead, it follows a heavy-tailed distribution—similar in some respects to a power law—called Benford’s law. See, for instance, Theodore P. Hill, “A Statistical Derivation of the Significant Digit Law,” Statistical Science 10, no. 4 (November 1995): 354–63, online at http://goo.gl/hLtUvm.
根据本福德定律,在文本中看到数字 1876 的可能性几乎为零。事实上,我们经常看到这个数字及其附近的数字,这在其他方面看来很不寻常,但考虑到它们主要与年份相对应,这完全合情合理。
According to Benford’s law, the likelihood of seeing the number 1876 in a text is virtually nil. In fact, we see this and nearby numbers in very significant quantities, an otherwise anomalous finding that makes perfect sense in light of the fact that they predominantly correspond to years.
本福特定律是一个非常有力的观察结果。例如,它可以用来检测纳税申报单中的欺诈行为:人们在伪造数字时往往不遵循本福特定律。这项应用是由谷歌现任首席经济学家哈尔·瓦里安等人提出的。参见哈尔·瓦里安,《致编辑的信》,《美国统计学家》第26卷,第3期(1972年6月)。更多关于心智与数字关系的论述,请参阅斯坦尼斯拉斯·德阿纳,《数感:心智如何创造数学》(牛津:牛津大学出版社,1997年)。
Benford’s law is a remarkably powerful observation. For instance, it can be used to detect fraud in tax returns: When fabricating numbers, people tend not to follow Benford’s law. This application was suggested by, among others, Hal Varian, currently chief economist at Google. See Hal Varian, “Letters to the Editor,” American Statistician 26, no. 3 (June 1972). For more on the relationship between the mind and numbers, see Stanislas Dehaene, The Number Sense: How the Mind Creates Mathematics (Oxford: Oxford University Press, 1997).
辞旧迎新
Out with the Old, In with the New
信息时代之前的信息速度。威廉·多克瓦拉(William Dockwra)建立了1680年,伦敦的便士邮政公司(Penny Post)广告宣传“一便士”邮递服务“每天至少十五次”,投递至“城内方便交易的地点”,从早上6点开始到晚上9点结束,大约每小时一次。他还承诺每天至少投递五次,投递到伦敦及其周边地区“最偏远的地方”,便士邮政保证在四小时或更短的时间内送达。如果今天的邮局也能做到这一点就太好了。您可以访问英国邮政博物馆和档案馆的“伦敦便士邮政”网站(http://goo.gl/qwAtI)阅读广告。参见凯瑟琳·戈尔登(Catherine Golden)著《寄信:维多利亚时代书信写作的革命》(盖恩斯维尔:佛罗里达大学出版社,2009年);乔治·布鲁梅尔(George Brumell)著《1680 - 1840年的伦敦地方邮政》(切尔滕纳姆,英国:RC Alcock出版社,1950年); “省级便士邮政/第五条款”,英国邮政博物馆和档案馆,http://goo.gl/jomYJ;Randall Stross,“廉价通信(和垃圾邮件)的诞生”,纽约时报,2010 年 2 月 20 日,在线网址为 http://goo.gl/SO0L0Y;Robert Darnton,“早期信息社会:十八世纪巴黎的新闻和媒体”,美国历史评论105,第 1 期(2000 年 2 月)。
The speed of information before the information age. William Dockwra established the Penny Post in London in 1680, advertising “For One Penny” delivery “at least fifteen times per day” to “places of quick negotiation within the City,” starting at 6:00 a.m. and ending at 9:00 p.m., or about once every hour. He also promised delivery at least five times a day “to the most remote places” in and around London, and the Penny Post guaranteed delivery in four hours or less. It would be great if the post office could do that today. Read the advertisement yourself at “London Penny Post,” The British Postal Museum & Archive, http://goo.gl/qwAtI. See Catherine Golden, Posting It: The Victorian Revolution in Letter Writing (Gainesville: University Press of Florida, 2009); George Brumell, The Local Posts of London 1680–1840 (Cheltenham, England: R. C. Alcock, 1950); “Provincial Penny Post/5th Clause,” The British Postal Museum & Archive, http://goo.gl/jomYJ; Randall Stross, “The Birth of Cheap Communication (and Junk Mail),” New York Times, February 20, 2010, online at http://goo.gl/SO0L0Y; Robert Darnton, “An Early Information Society: News and the Media in Eighteenth-Century Paris,” American Historical Review 105, no. 1 (February 2000).
巴克敏斯特·富勒绘制了一幅精美的图表,展现了信息在历史上传播的最大速度。请参阅巴克敏斯特·R·富勒和约翰·麦克海尔合著的《地球的萎缩》,在线版,网址:http://goo.gl/IfvqBL。
Buckminster Fuller created a beautiful graphical representation of the maximal speed at which information could travel throughout history. See Buckminster R. Fuller and John McHale, “Shrinking of Our Planet,” online at http://goo.gl/IfvqBL.
过去,快速传播的不仅仅是信息。在十九世纪,包裹可以通过——确切地说——地下管道网络——在城市中从一个地方寄到另一个地方。这些气动管道利用气压,将包裹以高达每小时25英里的速度送达纽约和巴黎等城市的各个角落。它们被组织成庞大而复杂的管道网络,遍布许多大城市的大部分地区。纽约在20世纪50年代停止了气动邮件的使用。巴黎的气动邮件系统一直运行到20世纪80年代,之后它被传真机广泛取代。如今,我们确实生活在一个信息时代,信息传递变得异常高效。但如果你想把一个真正的菠萝寄到曼哈顿的另一边,而不是仅仅寄送一个菠萝的动图或一封关于菠萝的信件,那么你很可能生活在一个世纪以前会更好。
It’s not just pure information that moved quickly in earlier times. In the nineteenth century, physical parcels could be mailed from place to place in a city through—quite literally—a network of underground tubes. These pneumatic tubes used air pressure to deliver parcels all over cities like New York and Paris at speeds of up to twenty-five miles per hour. They were organized into vast, complex tube networks that made their way through large parts of many major cities. New York discontinued use of pneumatic mail in the ’50s. Paris kept its system working through the ’80s, when it was largely replaced by the use of fax machines. Today, we do indeed live in an information age, in which we’ve become phenomenally good at moving around information. But if you wanted to send an actual pineapple across Manhattan, instead of just sending a gif of a pineapple or a letter about a pineapple, it’s quite possible that you would have been better off living a century ago.
据推测,这些管道至今仍然存在,我们猜想啮齿动物肯定时不时地会栖息在里面。因此,可以说纽约地下埋藏着一条信息高速公路,由在管道里跑来跑去的松鼠组成。但这并非互联网。(而且,这很可能是老鼠,而不是松鼠。)参见:JD Hayhurst,《巴黎的气动邮件》(牛津:英国法国和殖民地集邮协会,1974 年);LC Stanway,《伦敦地下邮件:伦敦地铁邮件运输的故事》(英国巴西尔登:埃塞克斯集邮协会,2000 年);“气动邮件”,国家邮政博物馆,http://goo.gl/uwsgmz。
Presumably, these tubes still exist, and we imagine that rodents must inhabit them from time to time. Thus it is fair to say that underneath New York lies an information superhighway consisting of squirrels running around in tubes. It’s just not the Internet. (And it’s probably rats, not squirrels.) See: J. D. Hayhurst, The Pneumatic Post of Paris (Oxford: France and Colonies Philatelic Society of Great Britain, 1974); L. C. Stanway, Mails Under London: The Story of the Carriage of the Mails on London’s Underground Railways (Basildon, England: Association of Essex Philatelic Societies, 2000); “Pneumatic Mail,” National Postal Museum, http://goo.gl/uwsgmz.
值得注意的是,PayPal、特斯拉汽车和SpaceX背后的企业家埃隆·马斯克最近提议恢复用于载人和货物的气动管道运输,他将这种公共交通方式命名为“超级高铁”(Hyperloop)。参见达蒙·拉夫林克,《埃隆·马斯克认为他能在45分钟内让你从纽约到达洛杉矶》,CNN科技频道,2013年7月17日,http://goo.gl/EXPdT。
Notably, Elon Musk, the entrepreneur behind PayPal, Tesla Motors, and SpaceX, recently proposed bringing back pneumatic tube transport for both humans and cargo, an approach to mass transit that he has dubbed the Hyperloop. See Damon Lavrinc, “Elon Musk Thinks He Can Get You from NY to LA in 45 Minutes,” CNN Tech, July 17, 2013, http://goo.gl/EXPdT.
尤里卡
Eureka
传真机。传真机怎么可能比电话发明得早呢?充分编码人类声音的丰富内容可能比编码几何图形更具挑战性。
The fax machine. How is it possible that the fax was invented before the telephone? Adequately encoding the richness of a human voice was likely a greater challenge than encoding geometric figures.
专利警告
Patent Caveat
谁发明了电话?关于谁应该获得“电话发明者”称号的争论仍在继续。2002年,美国众议院投票决定承认安东尼奥·梅乌奇是电话的发明者。与此同时,加拿大政府正式宣布证据不足以支持梅乌奇的说法。我们希望联合国安理会能尽快介入。参见罗伯特·V·布鲁斯,《贝尔:亚历山大·格雷厄姆·贝尔与征服孤独》(波士顿:利特尔布朗出版社,1973年)。更多关于梅乌奇的信息,请参阅《科学美国人》增刊,第520期(1885年12月19日)。
Who invented the phone? The debate over who deserves the title of “inventor of the telephone” is still raging. In 2002, the United States House of Representatives voted to recognize Antonio Meucci as an inventor of the telephone. Meanwhile, the Canadian government officially declared that the evidence was not substantial enough to support Meucci’s claim. We hope the United Nations Security Council will weigh in soon. See Robert V. Bruce, Bell: Alexander Graham Bell and the Conquest of Solitude (Boston: Little, Brown, 1973). For more on Meucci, see Scientific American Supplement, no. 520 (December 19, 1885).
“然后我对着 M 大喊。”请参阅美国国会图书馆收藏的亚历山大·格雷厄姆·贝尔家族文件,1862-1939 年,http://memory.loc.gov/ammem/bellhtml/。
“I then shouted into M.” See the Alexander Graham Bell Family Papers at the Library of Congress, 1862–1939, http://memory.loc.gov/ammem/bellhtml/.
147次相亲
147 Blind Dates
发明清单。本研究中使用的发明完整清单可在 Michel2011S 中找到。某项发明的发明和专利授权之间必然存在时间间隔,通常长达数年。在某些情况下,发明日期可以明确确定,但专利授权却经历了一段特别长的延迟。例如,特雷门琴(Theremin),一种由俄罗斯人列昂·特雷门(Leon Theremin)于 1920 年发明的乐器;该乐器的美国专利于 1928 年获得授权。在这种情况下,我们使用发明日期,而不是专利授权日期。
List of inventions. The full list of inventions we used for this study is available in Michel2011S. There is invariably a lag, usually several years long, between when something was invented and when the patent was issued. In some cases, the date of invention could be unambiguously determined, and there was a particularly long delay before the patent was granted. An example is the theremin, a musical instrument invented in 1920 by Leon Theremin in Russia; the U.S. patent for the device was issued in 1928. In such cases, we use the date of invention, not the date on which the patent was issued.
发明的生命周期。埃弗里特·M·罗杰斯所著的《创新的扩散》(纽约:自由出版社,1962年)是一部探讨创新在社会中传播方式的经典著作。
The life cycle of inventions. Everett M. Rogers, Diffusion of Innovations (New York: Free Press, 1962) is a classic text on the way innovation spreads through a society.
奇点或破产!
Singularity or Bust!
乌拉姆与冯·诺依曼。这段引文出自斯坦尼斯拉夫·乌拉姆撰写的一篇感人至深的冯·诺依曼讣告,乌拉姆在讣告中回忆了与冯·诺依曼就此话题进行的一次讨论。这篇讣告广泛回顾了冯·诺依曼对现代科学的众多远见卓识的贡献。参见斯坦尼斯拉夫·乌拉姆,《约翰·冯·诺依曼 1903–1957》,《美国数学学会公报》第64卷(1958年):1-49页。
Ulam and von Neumann. The quote is from a moving obituary of von Neumann written by Stanislaw Ulam, in which Ulam recalls a discussion with von Neumann on this topic. The obituary provides a broad review of von Neumann’s numerous visionary contributions to modern science. See Stanislaw Ulam, “John von Neumann 1903–1957,” Bulletin of the American Mathematical Society 64 (1958): 1–49.
雷·库兹韦尔。参见其著作《奇点临近:当人类超越生物学》(纽约:维京出版社,2005年)。自2012年以来,库兹韦尔一直担任谷歌工程总监,致力于让计算机理解自然语言。
Ray Kurzweil. See his book The Singularity Is Near: When Humans Transcend Biology (New York: Viking, 2005). Since 2012, Kurzweil has been the director of engineering at Google, with a mandate to make computers understand natural language.
民族精神、文化、文化经济学
Volksgeist, Culture, Culturomics
约翰·戈特弗里德·赫尔德。除了“民族精神”(Volksgeist)一词外,赫尔德还创造了被广泛使用的“时代精神”(Zeitgeist),即“时代精神”。参见约翰·戈特弗里德·赫尔德, 《人类历史哲学反思》(芝加哥:芝加哥大学出版社,1968年);弗雷德里克·M·巴纳德,《赫尔德的社会与政治思想》(牛津:克拉伦登出版社,1965年)。
Johann Gottfried Herder. In addition to the term Volksgeist, Herder also coined the widely used term Zeitgeist, or “spirit of the time.” See Johann Gottfried Herder, Reflections on the Philosophy of the History of Mankind (Chicago: University of Chicago Press, 1968); Frederick M. Barnard, Herder’s Social and Political Thought (Oxford: Clarendon Press, 1965).
赫尔德、民族主义与种族主义。参见罗伯特·莱因霍尔德·埃尔冈,《赫尔德与德国民族主义的基础》(纽约:哥伦比亚大学出版社,1931年);乔治·M·弗雷德里克森,《种族主义:简史》(新泽西州普林斯顿:普林斯顿大学出版社,2003年);伊芙·加勒德和杰弗里·斯卡里编,《道德哲学与大屠杀》(佛蒙特州伯灵顿:阿什盖特出版社,2003年)。
Herder, nationalism, and racism. See Robert Reinhold Ergang, Herder and the Foundations of German Nationalism (New York: Columbia University Press, 1931); George M. Fredrickson, Racism: A Short History (Princeton, NJ: Princeton University Press, 2003); Eve Garrard and Geoffrey Scarrey, eds., Moral Philosophy and the Holocaust (Burlington, VT: Ashgate, 2003).
弗朗茨·博厄斯。当然,博厄斯对文化的看法对仇恨贩子来说并非好事。纳粹焚烧了他的著作,撤销了他的博士学位,并谴责博厄斯学派人类学是“犹太科学”。更多关于博厄斯对文化概念的贡献,请参阅乔治·W·斯托金(George W. Stocking, Jr.)的《历史视角下的弗朗茨·博厄斯与文化概念》,载《美国人类学家》第68卷(1966年):867-82页,在线访问:http://goo.gl/VIyZ8g。另请参阅斯托金主编的《作为方法与伦理的民族精神:论博厄斯民族志与德国人类学传统》(麦迪逊:威斯康星大学出版社,1998年)。尤其值得一提的是,马蒂·邦泽尔(Matti Bunzl)为该书撰写的《弗朗茨·博厄斯与洪堡传统:从民族精神和民族性格到人类学的文化概念》。
Franz Boas. Of course, Boas’ take on culture was bad business for hate-mongers. The Nazis burned his books, rescinded his PhD, and denounced Boasian anthropology as “Jewish science.” For more on Boas’ contributions to the concept of culture, see George W. Stocking, Jr., “Franz Boas and the Culture Concept in Historical Perspective,” American Anthropologist 68 (1966): 867–82, online at http://goo.gl/VIyZ8g. Also see Stocking’s edited volume Volksgeist as Method and Ethic: Essays on Boasian Ethnography and the German Anthropological Tradition (Madison: University of Wisconsin Press, 1998). In particular, see Matti Bunzl’s contribution to that book, “Franz Boas and the Humboldtian Tradition: From Volksgeist and Nationalcharakter to an Anthropological Notion of Culture.”
-omics。当我们创造“文化组学”一词时,我们一直希望它的发音是长o ,就像基因组学的标准发音一样(或者像单词“owe”)。然而,最近麦克米伦词典的发音指南指出,该词应该发短o的音,就像在economics中那样。(参见上文“四个生日和一个葬礼”的注释。)难道词典在类似的事情上也犯错吗?是我们读错了吗?是我们从一开始就读错了,还是在麦克米伦发布声明后才开始读错?有关-omics的更多信息,请参阅詹姆斯·戈尔曼的《'Ome,' 科学宇宙扩展之声》,《纽约时报》 ,2012 年 5 月 3 日,http://goo.gl/I0um5。
-omics. When we coined the term “culturomics,” we always intended for it to be pronounced with a long o, as in the standard pronunciation of genomics (or as in the word “owe”). Recently, however, the pronunciation guide of the Macmillan dictionary reported that the word ought to be pronounced with a short o, as in economics. (See notes to “Four Birthdays and a Funeral,” above.) Can the dictionary be wrong about something like this? Did we get it wrong? Were we mispronouncing it from the beginning, or did we start being wrong only after Macmillan made its announcement? For more about -omics, see James Gorman, “‘Ome,’ the Sound of the Scientific Universe Expanding,” New York Times, May 3, 2012, http://goo.gl/I0um5.
应对成瘾:一种新策略
Coping with Addiction: A New Strategy
Ngram Viewer。我们想为开发出如此高效的时间浪费工具向所有人致歉。我们从未想过要浪费大家这么多时间。要是有什么办法可以弥补生产力损失造成的损失就好了。更多关于 Ngram Viewer 的使用方法,请参阅 Patricia Cohen 的《5000 亿词汇,开启文化新窗口》(《纽约时报》),2010 年 12 月 16 日,在线访问 http://goo.gl/16gtxR;Alexis C. Madrigal 的《吸血鬼与僵尸:不同时期词汇使用情况比较》(《大西洋月刊》) ,2010 年 12 月 17 日,在线访问 http://goo.gl/MUUnG1。
Ngram Viewer. We would like to apologize to everyone for creating such an effective time-waster. It was never our intention to waste so much of people’s time. If only there were some way we could undo the damage caused by all that lost productivity. For more on how the Ngram Viewer has been used, see Patricia Cohen, “In 500 Billion Words, a New Window on Culture,” New York Times, December 16, 2010, online at http://goo.gl/16gtxR; Alexis C. Madrigal, “Vampire vs. Zombie: Comparing Word Usage Through Time,” Atlantic, December 17, 2010, online at http://goo.gl/MUUnG1.
妈妈,火星人从哪里来?
Mommy, where do Martians come from?
伽利略。伽利略在《关于两大世界体系的对话》(第321页)中讨论了这一点。有关伽利略火星观测结果的现代重建尝试,请参阅威廉·T·彼得斯,《1610年金星和火星的出现》,《天文学史杂志》第15卷,第3期(1984年)。
Galileo. Galileo discusses this point in Dialogue Concerning the Two Chief World Systems, 321. For a modern attempt to reconstruct some of Galileo’s Martian observations, see William T. Peters, “The Appearances of Venus and Mars in 1610,” Journal for the History of Astronomy 15, no. 3 (1984).
夏帕雷利。参见 Giovanni Virginio Schiaparelli,《La Vita sul Pianeta Marte》(米兰:模仿文化协会,1998 年)。
Schiaparelli. See Giovanni Virginio Schiaparelli, La Vita sul Pianeta Marte (Milan: Associazione Culturale Mimesis, 1998).
火星运河。洛厄尔最初就该主题撰写了三本著作,分别是《火星》(波士顿:霍顿·米夫林,1895年)、《火星及其运河》(纽约:麦克米伦,1911年)和《火星作为生命之所》(纽约:麦克米伦,1908年)。阿尔弗雷德·拉塞尔·华莱士在其著作《火星适宜居住吗?》(纽约:麦克米伦,1907年)中驳斥了洛厄尔的观点。另请参阅史蒂文·J·迪克的《其他世界的生命》(剑桥:剑桥大学出版社,1998年);罗伯特·马克利的《垂死的星球》(北卡罗来纳州达勒姆:杜克大学出版社,2005年)。更多关于洛厄尔的信息,请参阅大卫·施特劳斯的《珀西瓦尔·洛厄尔》(马萨诸塞州剑桥:哈佛大学出版社,2001年)。
Canals on Mars. Lowell’s original three books on the topic are Mars (Boston: Houghton Mifflin, 1895); Mars and Its Canals (New York: Macmillan, 1911); and Mars as the Abode of Life (New York: Macmillan, 1908). Alfred Russel Wallace, in Is Mars Habitable? (New York: Macmillan, 1907), refuted Lowell’s position. See also Steven J. Dick, Life on Other Worlds (Cambridge: Cambridge University Press, 1998); Robert Markley, Dying Planet (Durham, NC: Duke University Press, 2005). For more about Lowell, see David Strauss, Percival Lowell (Cambridge, MA: Harvard University Press, 2001).
美国天文学家的院长。参见David H. Devorkin著《亨利·诺里斯·罗素:美国天文学家的院长》(新泽西州普林斯顿:普林斯顿大学出版社,2000年)。
Dean of American astronomers. See David H. Devorkin, Henry Norris Russell: Dean of American Astronomers (Princeton, NJ: Princeton University Press, 2000).
“也许是最好的。”参见迪克,《其他世界的生活》,第 35 页。
“Perhaps the best.” See Dick, Life on Other Worlds, 35.
《世界大战》。参见赫伯特·乔治·威尔斯,《世界大战》(伦敦:威廉·海涅曼,1898年)。
The War of the Worlds. See H. G. Wells, The War of the Worlds (London: William Heinemann, 1898).
火星地球仪曾用于规划水手号任务。该地球仪基于一张名为MEC-1原型的地图,该地图由曾在洛厄尔手下训练的EC·斯利弗(EC Slipher)绘制。尽管科学界普遍反对运河,但斯利弗似乎一直对运河持乐观态度,直到1964年去世。水手4号飞掠火星是在1965年。您可以在http://goo.gl/GrOKZ上查看MEC-1原型地图,甚至可以使用谷歌地球(Google Earth)探索火星运河地图。有关如何操作的视频,请参阅“火星”,谷歌地球,http://goo.gl/ZXZZa。斯利弗的论文集收录于“EC·斯利弗文集”,亚利桑那在线档案馆,http://goo.gl/jXva1D。
The Martian globe used to plan the Mariner missions. The globe was based on a map known as the MEC-1 prototype, created by E. C. Slipher, who had trained under Lowell. Despite the scientific consensus having turned against canals, Slipher appears to have remained bullish about them until his death in 1964. The Mariner 4 flyby took place in 1965. You can see the MEC-1 prototype map at http://goo.gl/GrOKZ, and you can even explore the maps of Martian canals using Google Earth. For a video that describes how, see “Mars,” Google Earth, http://goo.gl/ZXZZa. Slipher’s collected papers are at “E. C. Slipher Collection,” Arizona Archives Online, http://goo.gl/jXva1D.
水手号。有关水手号任务的更多信息,请参阅约翰·汉密尔顿 (John Hamilton) 的《水手号火星任务》 (明尼阿波利斯:ABDO,1998 年)。
Mariner. For more on the Mariner missions, see John Hamilton, The Mariner Missions to Mars (Minneapolis: ABDO, 1998).
第 7 章 乌托邦、反乌托邦和 DAT(A) 乌托邦
CHAPTER 7. UTOPIA, DYSTOPIA, AND DAT(A)TOPIA
简介
Intro
大卫王。参阅撒母耳记下第24章。
King David. See II Samuel 24.
埃德加·爱伦·坡。参见杰弗里·迈耶斯,《埃德加·爱伦·坡:他的一生与遗产》(纽约:查尔斯·斯克里布纳之子出版社,1992年)。坡气球恶作剧的低分辨率复制品出现在“巴黎气动网络”( Cix出版社,2000年),http://goo.gl/nCo3s。
Edgar Allan Poe. See Jeffrey Meyers, Edgar Allan Poe: His Life and Legacy (New York: Charles Scribner’s Sons, 1992). A low-resolution facsimile of Poe’s balloon hoax appears at “Réseau Pneumatic de Paris,” Cix, 2000, http://goo.gl/nCo3s.
更新 Ngram 查看器。最新版本的 Ngram 数据取自八百万本图书,并添加了词性标注功能。请参阅 Yuri Lin 等人的论文“Google 图书 Ngram 语料库的句法注释”, ACL 2012 系统演示论文集(2012 年):169-174;Yuri Lin 的论文“Google 图书的句法注释 Ngram”(硕士论文,麻省理工学院,2012 年)。
Update to the Ngram Viewer. The most recent version of the ngram data draws from eight million books and introduces part-of-speech tagging. See Yuri Lin et al., “Syntactic Annotations for the Google Books Ngram Corpus,” Proceedings of the ACL 2012 System Demonstrations (2012): 169–74; Yuri Lin, “Syntactically Annotated Ngrams for Google Books” (master’s thesis, Massachusetts Institute of Technology, 2012).
谷歌数字化的图书数量。参见罗伯特·达恩顿,《国家数字公共图书馆正式启动!》 , 《纽约书评》,2013年4月25日,在线网址:http://goo.gl/OI5n2J。
The number of books digitized by Google. See Robert Darnton, “The National Digital Public Library Is Launched!,” New York Review of Books, April 25, 2013, online at http://goo.gl/OI5n2J.
电子书。到2009年,亚马逊的电子书销量已超过精装书。参见Charlie Sorrel的《亚马逊:今年圣诞节Kindle电子书销量超过实体书》 , 《连线》杂志,2009年12月28日,在线访问http://goo.gl/ZsB7it。2012年,电子书占据美国图书市场的23%。参见Jeremy Greenfield的《2012年电子书占出版商收入的23%,尽管增长速度加快》,《数字图书世界》 ,2013年4月11日,在线访问http://goo.gl/u0d1GJ。
E-books. By 2009, Amazon was already selling more e-books than hardcover books. See Charlie Sorrel, “Amazon: Kindle Books Outsold Real Books This Christmas,” Wired, December 28, 2009, online at http://goo.gl/ZsB7it. In 2012, e-books accounted for 23 percent of the book market in the United States. See Jeremy Greenfield, “Ebooks Account for 23% of Publisher Revenue in 2012, Even as Growth Levels,” Digital Book World, April 11, 2013, online at http://goo.gl/u0d1GJ.
扩大数字图书的可及性。HathiTrust(http://www.hathitrust.org)、互联网档案馆(http://archive.org/index.php)、古腾堡计划(http://www.gutenberg.org)以及美国数字公共图书馆(http://dp.la)是一些旨在向公众提供数字图书的著名项目。当书籍全文可供获取时,人们可以构建更强大的工具来追踪文化趋势。
Increasing access to digital books. The HathiTrust (http://www.hathitrust.org), the Internet Archives (http://archive.org/index.php), Project Gutenberg (http://www.gutenberg.org), and the Digital Public Library of America (http://dp.la) are several of the most notable efforts aimed at making digital books available to the public. When the full texts of books are available, one can build far more powerful tools for cultural trends.
可以在 bookworm.culturomics.org 上找到一个示例。谷歌对原版 Bookworm 的闭源改编版本使用的名称是“Ngram Viewer”。“Bookworm”是文化观察站的一项开源工作。Bookworm 代码库由 Benjamin Schmidt、Neva Cherniavsky Durand、Martin Camacho、Matthew Nicklay 和 Linfeng Yang 共同开发。Schmidt 是首席开发人员。
An example can be found at bookworm.culturomics.org. Google’s closed-source adaptation of the original Bookworm uses the name “Ngram Viewer.” “Bookworm” is an open-source effort at the Cultural Observatory. The Bookworm code base was developed together with Benjamin Schmidt, Neva Cherniavsky Durand, Martin Camacho, Matthew Nicklay, and Linfeng Yang. Schmidt was the lead developer.
图书馆面临的威胁。参见S·彼得·戴维斯,《我们正处于历史上又一个‘焚书’时期的六个原因》,《 Cracked 》 ,2011年10月11日,http://goo.gl/FBZoD;马修·谢尔,《死书俱乐部》,《纽约》 ,2012年8月12日,http://goo.gl/UAIDN;玛丽·琼斯,《康威图书馆服务公司将大卫·劳合·乔治的书籍制成纸浆》,《每日邮报》,2011年3月24日,http://goo.gl/b1pK0;海伦·卡特,《作家和诗人呼吁曼彻斯特中央图书馆停止书籍制浆》,《卫报》,2012年6月22日,http://goo.gl/lEas1P。
The threat to libraries. See S. Peter Davis, “6 Reasons We’re in Another ‘Book-Burning’ Period in History,” Cracked, October 11, 2011, http://goo.gl/FBZoD; Matthew Shaer, “Dead Books Club,” New York, August 12, 2012, http://goo.gl/UAIDN; Mari Jones, “David Lloyd George’s Books Pulped by Conwy Libraries Services,” Daily Post, March 24, 2011, http://goo.gl/b1pK0; Helen Carter, “Authors and Poets Call Halt to Book Pulping at Manchester Central Library,” Guardian, June 22, 2012, http://goo.gl/lEas1P.
报纸数字化。参见“编年史美国”(Chronicling America),美国国家人文基金会,http://chroniclingamerica.loc.gov;Trove,澳大利亚国家图书馆,http://trove.nla.gov.au;以及现已停止运营的“谷歌新闻档案”(Google News),http://news.google.com/newspapers。
Newspaper digitization. See Chronicling America, National Endowment for the Humanities, http://chroniclingamerica.loc.gov; Trove, National Library of Australia, http://trove.nla.gov.au; and the now-defunct effort Google News Archive, Google News, http://news.google.com/newspapers.
古代及未出版的文献。例如,参见耶路撒冷以色列博物馆的“数字化死海古卷”,http://dss.collections.imj.org.il;塔夫茨大学珀尔修斯数字图书馆,http://www.perseus.tufts.edu。德克萨斯大学奥斯汀分校哈里·兰塞姆中心的“埃德加·爱伦·坡数字收藏”中,可以找到与坡相关的文物的数字化成果,http://goo.gl/XvcqO。
Ancient and unpublished texts. See, for instance, “Digitized Dead Sea Scrolls,” Israel Museum, Jerusalem, http://dss.collections.imj.org.il; Perseus Digital Library, Tufts University, http://www.perseus.tufts.edu. An effort to digitize artifacts related to Poe can be found at “The Edgar Allan Poe Digital Collection,” Harry Ransom Center, University of Texas at Austin, http://goo.gl/XvcqO.
数字化物理世界。请访问欧洲数字图书馆 (Europeana),http://europeana.eu,了解欧洲为开放文本、艺术品、电影以及许多其他文化资源所做的巨大努力。
Digitizing the physical world. See Europeana, http://europeana.eu, for a vast effort at opening access to texts, artwork, films, and many other cultural objects in Europe.
数字礼物
Digital Present
我们的数据足迹。参见Josh James,《每分钟产生了多少数据?》, DOMO,2012年6月8日,http://goo.gl/RN5eB。珀尔修斯图书馆项目(Perseus Library Project)主编格雷戈里·克兰(Gregory Crane)教授指出,大约有一亿个希腊语词汇在公元600年之前留存下来。该项目旨在将所有古希腊文献数字化;格雷戈里·克兰,2013年5月18日致让-巴蒂斯特·米歇尔(Jean-Baptiste Michel)的电子邮件。
Our data footprint. See Josh James, “How Much Data Is Created Every Minute?,” DOMO, June 8, 2012, http://goo.gl/RN5eB. Professor Gregory Crane, editor in chief of the Perseus Library Project, aiming to digitize all texts from ancient Greece, suggested that roughly one hundred million words of Greek survive from before 600 CE; Gregory Crane, e-mail to Jean-Baptiste Michel, May 18, 2013.
垃圾邮件。 2010年,全球发送的107万亿封电子邮件中,89.1%是垃圾邮件。请参阅Royal Pingdom网站2011年1月12日发表的“2010年互联网数据”,网址:http://goo.gl/ziXncU。
Spam. Of the 107 trillion e-mails sent in 2010, 89.1 percent were spam. See “Internet 2010 in Numbers,” Royal Pingdom, January 12, 2011, online at http://goo.gl/ziXncU.
数字化未来
Digital Future
TotalRecall。Deb Roy 的 TED 演讲风趣幽默,内容翔实。详情请参阅 Deb Roy 的《一个词的诞生》(视频,2011 年 3 月 19:52,http://goo.gl/5MoJo)。更多项目详情,请参阅 Jonathan Keats 的《咿呀学语的力量》(《连线》杂志,2007 年 3 月,http://goo.gl/3epTR)和 Jason B. Jones 的《让家庭录像更有价值》(《连线》杂志,2011 年 3 月 25 日,http://goo.gl/V3oTL)。更多技术概述,请参阅 Deb Roy 等人的《人类语音组计划》(麻省理工学院,2006 年 7 月,http://goo.gl/O3E0e)和 Rony Kubat 等人的《TotalRecall:大型视听语料库的可视化和半自动标注》(麻省理工学院,http://goo.gl/Dra7T)。
TotalRecall. Deb Roy’s TED talk is entertaining and informative. See Deb Roy, The Birth of a Word, video, 19:52, March 2011, http://goo.gl/5MoJo. More details about the project are available at Jonathan Keats, “The Power of Babble,” Wired, March 2007, http://goo.gl/3epTR; Jason B. Jones, “Making That Home Video Count,” Wired, March 25, 2011, http://goo.gl/V3oTL. More technical overviews include Deb Roy et al., “The Human Speechome Project,” Massachusetts Institute of Technology, July 2006, http://goo.gl/O3E0e; Rony Kubat et al., “TotalRecall: Visualization and Semi-Automatic Annotation of Very Large Audio-Visual Corpora,” Massachusetts Institute of Technology, http://goo.gl/Dra7T.
生活记录。生活记录、可穿戴计算以及日益流行的量化自我概念都是密切相关的概念。参见 Steve Henn 的《聪明的黑客赋予谷歌许多意想不到的力量》,NPR,2013 年 7 月 17 日,http://goo.gl/eyUW9;Edna Pasher 和 Michael Lawo 的《智能服装》(宾夕法尼亚州兰斯代尔:IOS 出版社,2009 年);Tomio Geron 的《扫描你的太阳穴,用新的未来设备管理你的健康》,《福布斯》,2012 年 11 月 29 日,http://goo.gl/9lg72;Greg Beato 的《量化的自我》,《Reason》,2011 年 12 月 21 日;Mark Krynsky 的《2013 年 CES 上最好的健康和健身小工具公告》,Lifestream Blog,2013 年 1 月 18 日,http://goo.gl/Qq0BY; Eric Topol,《医学的创造性破坏》(纽约:Basic Books,2011 年);Jody Ranck,《互联健康》(旧金山:GigaOM,2012 年)。
Life logging. Life logging, wearable computing, and the increasingly fashionable notion of a quantified self are all intimately related concepts. See Steve Henn, “Clever Hacks Give Google Many Unintended Powers,” NPR, July 17, 2013, http://goo.gl/eyUW9; Edna Pasher and Michael Lawo, Intelligent Clothing (Lansdale, PA: IOS Press, 2009); Tomio Geron, “Scan Your Temple, Manage Your Health with New Futuristic Device,” Forbes, November 29, 2012, http://goo.gl/9lg72; Greg Beato, “The Quantified Self,” Reason, December 21, 2011; Mark Krynsky, “The Best Health and Fitness Gadget Announcements from CES 2013,” Lifestream Blog, January 18, 2013, http://goo.gl/Qq0BY; Eric Topol, The Creative Destruction of Medicine (New York: Basic Books, 2011); Jody Ranck, Connected Health (San Francisco: GigaOM, 2012).
心机接口。参见Leigh R. Hochberg等人的论文《四肢瘫痪者对假肢的神经元整体控制》,《自然》 442卷,第7099期(2006年):164-171页;Martin M. Monti等人的论文《意识障碍中脑活动的任性调节》,《新英格兰医学杂志》 362卷,第7期(2010年):579-589页。这两项研究都具有里程碑意义。
Mind-machine interfaces. See Leigh R. Hochberg et al., “Neuronal Ensemble Control of Prosthetic Devices by a Human with Tetraplegia,” Nature 442, no. 7099 (2006): 164–71; Martin M. Monti et al., “Willful Modulation of Brain Activity in Disorders of Consciousness,” New England Journal of Medicine 362, no. 7 (2010): 579–89. Both are landmark studies.
意识流。参见史蒂芬·平克,《思想的本质》(纽约:维京企鹅出版社,2007年)和克里斯·斯沃耶,《相对主义》,《斯坦福哲学百科全书》 (2010年冬季)。意识流的概念通常被认为是威廉·詹姆斯提出的。
Stream of consciousness. See Steven Pinker, The Stuff of Thought (New York: Viking Penguin, 2007), and Chris Swoyer, “Relativism,” The Stanford Encyclopedia of Philosophy (Winter 2010). The notion of a stream of consciousness is generally credited to William James.
真相与后果
Truth and Consequences
波士顿马拉松爆炸案调查。调查人员仔细查阅了现场人员拍摄的大量照片和录像,并向公众征求意见,以确认两名嫌疑人的身份。参见斯宾塞·阿克曼,《波士顿马拉松调查数据将众包》,《连线》 ,2013年4月16日,网址:http://goo.gl/DpPKca;皮特·威廉姆斯等,《调查员恳求马拉松爆炸案调查协助:‘有人知道是谁干的’》,《NBC新闻》,2013年4月16日,网址:http://goo.gl/46kndz。
The Boston Marathon bombing investigation. Investigators combed through vast quantities of pictures and movies recorded by individuals present on the scene and asked the public for help identifying two suspects. See Spencer Ackerman, “Data for the Boston Marathon Investigation Will Be Crowdsourced,” Wired, April 16, 2013, online at http://goo.gl/DpPKca; Pete Williams et al., “Investigator Pleads for Help in Marathon Bombing Probe: ‘Someone Knows Who Did This,’” NBC News, April 16, 2013, online at http://goo.gl/46kndz.
雷泰·帕森斯。这位17岁的女孩于2013年4月4日上吊自杀。她随后陷入昏迷;三天后,生命维持系统被切断。参见《赫芬顿邮报》 ,2013年4月9日,在线版,网址:http://goo.gl/Cqs030。
Rehtaeh Parsons. The seventeen-year-old hanged herself on April 4, 2013. As a result, she fell into a coma; three days later, she was taken off life support. See “Rehtaeh Parsons, Canadian Girl, Dies After Suicide Attempt; Parents Allege She Was Raped by 4 Boys,” Huffington Post, April 9, 2013, online at http://goo.gl/Cqs030.
数据就是力量
Data Is Power
营销人员对你了解多少?参见查尔斯·杜希格,《公司如何了解你的秘密》,《纽约时报》 ,2012年2月16日,在线访问:http://goo.gl/DV04Me。
What marketers know about you. See Charles Duhigg, “How Companies Learn Your Secrets,” New York Times, February 16, 2012, online at http://goo.gl/DV04Me.
政府对你了解多少?参见约瑟夫·阿克斯,《占领华尔街抗议者无法阻止检察官发推文》,《芝加哥论坛报》,2012年9月17日。
What the government knows about you. See Joseph Ax, “Occupy Wall Street Protester Can’t Keep Tweets from Prosecutors,” Chicago Tribune, September 17, 2012.
参见 Jamie Skorheim 的文章“西雅图酒吧率先禁止使用谷歌眼镜”,MyNorthwest.com,2013年3月8日。
Offlogging. See Jamie Skorheim, “Seattle Bar Steps Up as First to Ban Google Glasses,” MyNorthwest.com, March 8, 2013.
Snapchat。需要注意的是,Snapchat“已删除”的消息至少在某些情况下是可以恢复的;这一发现已促使其向联邦贸易委员会正式投诉。参见Jessica Guynn的《隐私监管机构EPIC向联邦贸易委员会投诉Snapchat》,《洛杉矶时报》,2013年5月17日,http://goo.gl/WSxTxA。
Snapchat. Note that Snapchat’s “deleted” messages can be recovered, at least in some cases; this discovery has led to a formal complaint to the Federal Trade Commission. See Jessica Guynn, “Privacy Watchdog EPIC Files Complaint Against Snapchat with FTC,” Los Angeles Times, May 17, 2013, http://goo.gl/WSxTxA.
志趣相投的人
Kindred Spirits
先驱者。参见弗朗哥·莫雷蒂著《图表、地图、树状图:文学史的抽象模型》(伦敦:Verso出版社,2005年),以及乔治·米勒上文引用的引文(摘自《拆开玫瑰数花瓣》一书注释);马修·L·乔克斯著《宏观分析:数字化方法与文学史》(厄巴纳:伊利诺伊大学出版社,2013年);詹姆斯·M·休斯等著《文学演变中风格影响的量化模式》,载《美国国家科学院院刊》第109卷,第20期(2012年),第7682–86页,在线访问:http://goo.gl/3uaAoM;詹姆斯·W·彭尼贝克著《代词的秘密生活:我们的言语告诉我们什么》(纽约:布卢姆斯伯里出版社,2011年)。 “共享视野”会议的网址是http://goo.gl/fnyWw。如果您想深入了解科学与人文学科的未来,我们推荐爱德华·O·威尔逊的《一致性:知识的统一》(纽约:阿尔弗雷德·A·克诺夫出版社,1998年)。关于科学与人文学科之间紧张关系的重要参考书目是CP·斯诺的《两种文化与科学革命》(伦敦:剑桥大学出版社,1959年)。
Pioneers. See Franco Moretti, Graphs, Maps, Trees: Abstract Models for a Literary History (London: Verso, 2005), and, in this vein, the quote cited above (in the notes to “Taking Roses Apart to Count Their Petals”) by George Miller; Matthew L. Jockers, Macroanalysis: Digital Methods and Literary History (Urbana: University of Illinois Press, 2013); James M. Hughes et al., “Quantitative Patterns of Stylistic Influence in the Evolution of Literature,” Proceedings of the National Academy of Sciences 109, no. 20 (2012): 7682–86, online at http://goo.gl/3uaAoM; James W. Pennebaker, The Secret Life of Pronouns: What Our Words Say About Us (New York: Bloomsbury, 2011). The Shared Horizons conference Web site is at http://goo.gl/fnyWw. For an insightful read about the future of science and the humanities, we recommend Edward O. Wilson, Consilience: The Unity of Knowledge (New York: Alfred A. Knopf, 1998). The essential reference on the tension between the sciences and the humanities is C. P. Snow, The Two Cultures and the Scientific Revolution (London: Cambridge University Press, 1959).
心理史学
Psychohistory
量化社会。参见 Adolphe Quetelet, Sur l'Homme et le Développement de Ses Facultés,ou,Essai de Physique Sociale(布鲁塞尔:L. Hauman,1836);埃米尔·涂尔干 (Émile Durkheim),《社会学方法的规则》(Les Règles de la Méthode Sociologique)(巴黎:F. Alcan,1895);奥古斯特·孔德和哈里特·马蒂诺,《实证哲学》(纽约:AMS Press,1974)。值得将这一思路与 Zipf, 1935 的动机进行比较:
The quantified society. See Adolphe Quetelet, Sur l’Homme et le Développement de Ses Facultés, ou, Essai de Physique Sociale (Brussels: L. Hauman, 1836); Émile Durkheim, Les Règles de la Méthode Sociologique (Paris: F. Alcan, 1895); Auguste Comte and Harriet Martineau, The Positive Philosophy (New York: AMS Press, 1974). It’s worth comparing this line of thought with that which motivated Zipf, 1935:
大约十年前,当我在柏林大学学习语言学时,我突然想到,以精确科学的方式,通过将统计原理直接应用于客观的语音现象,来研究语音作为一种自然现象可能会很有成效。
Nearly ten years ago, while studying linguistics at the University of Berlin, it occurred to me that it might be fruitful to investigate speech as a natural phenomenon . . . in the manner of the exact sciences, by the direct application of statistical principles to the objective speech-phenomena.
图表。在哈佛大学学生马丁·卡马乔 (Martin Camacho) 和纪尧姆·巴斯 (Guillaume Basse) 的帮助下,我们分析了文化惯性。我们探究的是,那些线性增长、在二十年内翻倍的 ngram 是否会在最初的二十年之后继续上升。数百个这样的 ngram 被取平均值,从而绘制出图中所示的深灰色线;图中每个点都是该时间点平均值中包含的所有 ngram 的中位数。请注意,每个 ngram 的时间轴都有偏移,因此最初二十年的增长始终从第 0 年开始。由于 ngram 的选取方式,这二十年必然会出现急剧的增长,因此图中突出显示了这段时期。随后,ngram 继续上升,表明存在惯性。浅灰色平均值的 ngram 是根据二十年的线性下降趋势选取的。这些也显示了惯性,只是这次是向下的。这种效应非常明显。虽然从该图表中无法推断出什么,但在突出显示的下降趋势出现 30 年后,超过 90% 的 ngram 进一步下降。
Chart. We analyzed cultural inertia with the help of Martin Camacho and Guillaume Basse, both students at Harvard. We asked whether ngrams that increase linearly, leading to doubling over two decades, continue to rise after the initial two-decade period. Hundreds of such ngrams are averaged to create the dark gray line shown in the plot; each point in the plot is the median of all the ngrams included in the average at that time point. Note that the time axis for each ngram is offset so that the initial twenty-year rise always begins at Year 0. This initial twenty-year period, during which a sharp increase is guaranteed because of how the ngrams were selected, is highlighted. Subsequently, the ngrams continue to rise, indicating inertia. The ngrams averaged in light gray were selected based on a twenty-year-long linear decrease. These also show inertia, this time in the downward direction. The effect is very pronounced. Although it cannot be deduced from this chart, thirty years after the highlighted decline, more than 90 percent of the ngrams have gone down further.
“物理学家比较了一系列相似的事实。”参见 Franz Boas,《地理学研究》,《科学》 210S(1887):137–41。
“The physicist compares a series of similar facts.” See Franz Boas, “The Study of Geography,” Science 210S (1887): 137–41.
本索引中的页码指的是本书的印刷版。如需查找本书电子版正文中的对应位置,请使用电子阅读器上的“搜索”功能。请注意,并非所有术语均可搜索。
The page numbers in this index refer to the printed version of this book. To find the corresponding locations in the text of this digital version, please use the “search” function on your e-reader. Note that not all terms may be searchable.
斜体页码表示图表。
Page numbers in italics indicate charts.
演员,名声,113,114
Actors, fame of, 113, 114
亚当斯,约翰,2
Adams, John, 2
非裔美国人的权利,151
African-Americans, rights of, 151
艾登、英杰华、58、86、165、170、179
Aiden, Aviva, 58, 86, 165, 170, 179
爱德华·“巴兹”·奥尔德林,120,121,262页
Aldrin, Edward “Buzz,” 120, 121, 262n
算法,84
Algorithms, 84
亚马逊,9,59,189
Amazon, 9, 59, 189
美国在线(AOL),61–63,249 n,270 n
America Online (AOL), 61–63, 249n, 270n
美国人类学协会,31
American Anthropological Association, 31
美国出版商协会,59
American Association of Publishers, 59
美国方言学会(ADS),76,259 n
American Dialect Society (ADS), 76, 259n
美国传统词典(AHD),68,70–71,256–58 nn
American Heritage Dictionary (AHD), 68, 70–71, 256–58nn
安德森,乔恩·李,157
Anderson, Jon Lee, 157
克里斯蒂安·安德沃德,98–101, 108, 172
Andvord, Kristian, 98–101, 108, 172
人类学,23,31,176,268
Anthropology, 23, 31, 176, 268n
反共产主义,美国,139–41
Anti-communism, U.S., 139–41
苹果
Apple
iPod,172
iPod, 172
iTunes,59
iTunes, 59
阿波马托克斯战役,2
Appomattox, Battle of, 2
阿贝斯曼,塞缪尔,247 n
Arbesman, Samuel, 247n
梵蒂冈档案馆,82, 259–60 n
Archivio Segreto Vaticano, 82, 259–60n
阿姆斯特朗,尼尔,120–21,262 n
Armstrong, Neil, 120–21, 262n
《邦联条例》1
Articles of Confederation, 1
人工智能,58,174,249
Artificial intelligence, 58, 174, 249n
艺术家,名声,113,115
Artists, fame of, 113, 115
Artsy.net,193
Artsy.net, 193
阿西莫夫,艾萨克,208–10
Asimov, Isaac, 208–10
联想,记忆,155,159-60
Association, memory by, 155, 159–60
阿斯坦,弗雷德,103
Astaire, Fred, 103
天文学,7,22,29,113,182-83
Astronomy, 7, 22, 29, 113, 182–83
奥登,WH,254 n
Auden, W. H., 254n
奥斯维辛,148,149
Auschwitz, 148, 149
澳大利亚,190
Australia, 190
奥地利,152
Austria, 152
作家协会,59,260 n
Authors Guild, 59, 260n
艾尔斯,伦纳德,250 n
Ayres, Leonard, 250n
贝尔,哈罗德,Jr.,260 n
Baer, Harold, Jr., 260n
贝恩,亚历山大,167
Bain, Alexander, 167
鲍德温,亚历克,89,92
Baldwin, Alec, 89, 92
阿尔伯特·拉斯洛·巴拉巴西,13–14, 60
Barabási, Albert-László, 13–14, 60
巴巴罗,迈克尔,61–62
Barbaro, Michael, 61–62
巴伦,斯蒂芬妮,263 n
Barron, Stephanie, 263n
巴塞,纪尧姆,272 n
Basse, Guillaume, 272n
自由的战斗呐喊(麦克弗森),2–3
Battle Cry of Freedom (McPherson), 2–3
包豪斯运动,127,145
Bauhaus movement, 127, 145
拜耳,本,86岁
Bayer, Ben, 86
贝克曼,马克斯,132
Beckmann, Max, 132
北京大学,142
Beijing University, 142
贝尔,亚历山大·格雷厄姆,167–69
Bell, Alexander Graham, 167–69
本福德定律,266 n
Benford’s law, 266n
贝奥武甫,40,42
Beowulf, 40, 42
贝西·阿尔瓦,139
Bessie, Alvah, 139
比伯曼,赫伯特,139
Biberman, Herbert, 139
圣经,48,189,246
Bible, 48, 189, 246n
撒母耳记,185
Book of Samuel, 185
贾斯汀·比伯,88岁
Bieber, Justin, 88
生物学,7, 23, 26, 27, 31, 43, 49, 175, 176, 208
Biology, 7, 23, 26, 27, 31, 43, 49, 175, 176, 208
进化,32,43–44
evolution in, 32, 43–44
幂律,37
power laws in, 37
黑名单,140–41
Blacklisting, 140–41
博厄斯,弗朗兹,176,212,268 n
Boas, Franz, 176, 212, 268n
鲍勃利,布雷特,206,207
Bobley, Brett, 206, 207
博加特,汉弗莱,103,261 n
Bogart, Humphrey, 103, 261n
博汉农,约翰,261 n
Bohannon, John, 261n
布尔什维克,137,139
Bolsheviks, 137, 139
焚书, 124,133–36,175,262,268
Book burnings, 124, 133–36, 175, 262n, 268n
书虫,178–80,270 n
Bookworm, 178–80, 270n
谷歌版本。参见Ngram Viewer
Google’s version of. See Ngram Viewer
博尔赫斯,豪尔赫·路易斯,103
Borges, Jorge Luis, 103
波士顿环球报,190
Boston Globe, 190
波士顿马拉松爆炸案,2012年,262年,271年
Boston Marathon bombing, 201–2, 262n, 271n
乔治·布拉克,128
Braque, Georges, 128
胸罩的发明,170
Brassiere, invention of, 170
巴西,131
Brazil, 131
布雷克,阿诺,129
Breker, Arno, 129
布里斯班,亚瑟,24–25,248 n
Brisbane, Arthur, 24–25, 248n
大英图书馆,191
British Library, 191
布罗克曼,威廉,180
Brockman, William, 180
布鲁克斯,范威克,106
Brooks, Van Wyck, 106
“布朗语料库”(Kucera 和 Francis),256 n
“Brown Corpus” (Kucera and Francis), 256n
欺凌,202
Bullying, 202
罗伯托·布萨, 48, 49, 253 nn , 254 n
Busa, Roberto, 48, 49, 253nn, 254n
乔治·W·布什,67岁
Bush, George W., 67
加州大学
California, University of
伯克利,95岁
Berkeley, 95
圣地亚哥,13
San Diego, 13
卡马乔, 马丁, 270 n , 272 n
Camacho, Martin, 270n, 272n
剑桥大学,96
Cambridge University, 96
卡彭,艾尔,103
Capone, Al, 103
卡片目录,82–83,265 n
Card catalogs, 82–83, 265n
卡罗尔,刘易斯,67,68
Carroll, Lewis, 67, 68
天主教会,205–6
Catholic Church, 205–6
因果关系、相关性与 20
Causation, correlation versus, 20
手机,14 岁和 60 岁留下的数字痕迹
Cell phones, digital trail left by, 14, 60
玻璃纸的发明,171,172
Cellophane, invention of, 171, 172
细胞,6
Cells, 6
幂律,36
power laws of, 36
审查制度,137–49,179,190
Censorship, 137–49, 179, 190
反共产主义,在美国,139–41
anti-communist, in U.S., 139–41
中国,141–43
China, 141–43
检测,143–46
detection of, 143–46
纳粹,124、127–37、146–49
Nazi, 124, 127–37, 146–49
斯大林主义者,137–39
Stalinist, 137–39
美国疾病控制与预防中心(CDC),14
Centers for Disease Control and Prevention (CDC), 14
巴西银行文化中心,131
Centro Cultural Banco do Brasil, 131
夏加尔,马克,124–29,132–33,143,148–49,262 n,263 n
Chagall, Marc, 124–29, 132–33, 143, 148–49, 262n, 263n
查普曼,马克·戴维,118
Chapman, Mark David, 118
查尔斯王子,88岁
Charles, Prince, 88
乔叟,杰弗里,42岁
Chaucer, Geoffrey, 42
化学,7
Chemistry, 7
切蒂,拉吉,14–15
Chetty, Raj, 14–15
朱莉娅·查尔德(Julia Child),88岁
Child, Julia, 88
中国
China
审查制度,141–43,265 n
censorship in, 141–43, 265n
国家图书馆,16
National Library of, 16
瑞典王后克里斯蒂娜,260 n
Christina, Queen of Sweden, 260n
圣诞颂歌,A(狄更斯),93
Christmas Carol, A (Dickens), 93
卓柏卡布拉,54,254 n
Chupacabra, 54, 254n
丘吉尔,温斯顿,261 nn
Churchill, Winston, 261nn
民权运动,151
Civil rights movement, 151
美国内战,2–5,151
Civil War, U.S., 2–5, 151
克兰西,丹,86–87
Clancy, Dan, 86–87
克林顿,比尔,111–12,172,174
Clinton, Bill, 111–12, 172, 174
队列方法,100–102,108,172
Cohort method, 100–102, 108, 172
斯蒂芬·科尔伯特 86岁
Colbert, Stephen, 86
科尔,莱斯特,139
Cole, Lester, 139
科尔曼,玛丽苏,55,255 n
Coleman, Mary Sue, 55, 255n
集体记忆,153–54,157
Collective memory, 153–54, 157
吸收新信息,164–75
assimilation of new information into, 164–75
遗忘,157–64
forgetting and, 157–64
柯林斯,迈克尔,262 n
Collins, Michael, 262n
柯尔特,塞缪尔,171
Colt, Samuel, 171
苏联共产党第十三次代表大会,137
Communist Party (USSR), XIII Conference of, 137
复合词,258 n
Compound words, 258n
计算社会科学,207
Computational social science, 207
奥古斯特·孔德,210,212
Comte, Auguste, 210, 212
索引, 33, 47–49, 253 n
Concordances, 33, 47–49, 253n
康登,爱德华,250 n
Condon, Edward, 250n
美国国会,69,139
Congress, U.S., 69, 139
美国宪法,2,244页
Constitution, U.S., 2, 244n
权利法案,151
Bill of Rights, 151
哥白尼,尼古拉,7,89
Copernicus, Nicolaus, 7, 89
大卫·科波菲尔,94岁
Copperfield, David, 94
版权法,188
Copyright laws, 188
违反, 59
violation of, 59
语料库语言学,207
Corpus linguistics, 207
相关性、因果关系与 20
Correlation, causation versus, 20
克兰,格雷戈里,270 n
Crane, Gregory, 270n
众包,76
Crowdsourcing, 76
立体主义,127
Cubism, 127
文化观察站,188,270 n
Cultural Observatory, 188, 270n
培养组学,22–23,76,175,176,247 n, 256 n,259 n,268–69 n
Culturomics, 22–23, 76, 175, 176, 247n, 256n, 259n, 268–69n
居里夫人,玛丽,261 n
Curie, Marie, 261n
网络欺凌,202
Cyberbullying, 202
捷克斯洛伐克,96
Czechoslovakia, 96
达达,127
Dada, 127
达利,萨尔瓦多,103
Dalí, Salvador, 103
道尔顿,玛格丽特·斯蒂格,135
Dalton, Margaret Stieg, 135
达特茅斯学院,207
Dartmouth College, 207
达尔文,查尔斯,20,30,177,251n , 261n
Darwin, Charles, 20, 30, 177, 251n, 261n
大卫,以色列王,185
David, King of Israel, 185
大卫·科波菲尔(狄更斯),93岁
David Copperfield (Dickens), 93
戴维斯,杰斐逊,3
Davis, Jefferson, 3
死海古卷,191
Dead Sea Scrolls, 191
《人权和公民权宣言》,151
Declaration of the Rights of Man and the Citizen, 151
“堕落艺术”和展览(Entartete Kunst),127–33, 146, 263–64 nn
“Degenerate art” and exhibit (Entartete Kunst), 127–33, 146, 263–64nn
邓小平,103
Deng Xiaoping, 103
齿科后缀,39–40
Dental suffix, 39–40
文化决定论,210–12
Determinism, cultural, 210–12
德国国家图书馆,16
Deutsche Nationalbibliothek, 16
德国学生会,133–34
Deutsche Studentenschaft, 133–34
杜威,戈弗雷,250 n
Dewey, Godfrey, 250n
杜威,梅尔维尔,250 n
Dewey, Melvil, 250n
戴安娜王妃,88岁
Diana, Princess, 88
迪亚兹,朱诺特,256 n
Díaz, Junot, 256n
查尔斯·狄更斯,92–95,117,119
Dickens, Charles, 92–95, 117, 119
狄金森,艾米莉,87–88,93,107
Dickinson, Emily, 87–88, 93, 107
字典,68–71、73–77、109–10、252 n、256 n、258–59 nn
Dictionaries, 68–71, 73–77, 109–10, 252n, 256n, 258–59nn
英语词典,A(约翰逊),68
Dictionary of the English Language, A (Johnson), 68
深入数据挑战,206
Digging into Data Challenge, 206
数字足迹,11,23
Digital footprint, 11, 23
DigitalGlobe,21岁
DigitalGlobe, 21
数字人文,48,207
Digital humanities, 48, 207
美国数字公共图书馆,270 n
Digital Public Library of America, 270n
恐龙化石,31–32
Dinosaur fossils, 31–32
华特·迪士尼,139
Disney, Walt, 139
德米特里克,爱德华,139
Dmytryk, Edward, 139
DNA,65,177
DNA, 65, 177
Dockwra,William,266–67页
Dockwra, William, 266–67n
Dropbox,195
Dropbox, 195
杜尚,马塞尔,261 n
Duchamp, Marcel, 261n
唐恩都乐,178
Dunkin’ Donuts, 178
杜兰德,涅瓦·切尔尼亚夫斯基,270 n
Durand, Neva Cherniavsky, 270n
反乌托邦,204
Dystopias, 204
eBay,13,21,60
eBay, 13, 21, 60
艾宾浩斯,赫尔曼,154–57,159,161,171,172,176,266 n
Ebbinghaus, Hermann, 154–57, 159, 161, 171, 172, 176, 266n
电子书,187,189
E-books, 187, 189
经济学,13,25,210,222-23,269
Economics, 13, 25, 210, 222–23, 269n
埃德加·爱伦·坡博物馆(里士满),192
Edgar Allan Poe Museum (Richmond), 192
教育政策,14-15
Education policy, 14–15
爱因斯坦,阿尔伯特,20,116,135,178
Einstein, Albert, 20, 116, 135, 178
艾森豪威尔,德怀特,117
Eisenhower, Dwight, 117
埃尔德里奇,RC,249 n,250 n
Eldridge, R. C., 249n, 250n
选举结果预测,15
Election results, predicting, 15
美国选举团,15
Electoral College, U.S., 15
电子邮件,194–95
E-mail, 194–95
英国权利法案,151
English Bill of Rights, 151
英语语法,历史,17
English grammar, history of, 17
Entartete Kunst. See “Degenerate art” and exhibit
流行病,14,63,98–100
Epidemics, 14, 63, 98–100
恩斯特,马克斯,132
Ernst, Max, 132
让·巴蒂斯特·埃斯托,249–50 n
Estoup, Jean-Baptiste, 249–50n
埃塞俄比亚种族灭绝,118
Ethiopian genocide, 118
埃廷格,帕维尔,125–26
Ettinger, Pavel, 125–26
欧洲联盟,1
European Union, 1
欧洲数字图书馆,192
Europeana, 192
进化,20,177,178
Evolution, 20, 177, 178
生物学,32,43–44
biological, 32, 43–44
语言,31、32、37-36、41、49、178
of language, 31, 32, 37–36, 41, 49, 178
表现主义,127,264 n
Expressionism, 127, 264n
眼镜的发明,5
Eyeglasses, invention of, 5
Facebook, 9, 10, 13, 21, 36, 59, 193, 204
Facebook, 9, 10, 13, 21, 36, 59, 193, 204
上传图片至,195,202
uploading pictures to, 195, 202
交友习惯的变化,19
variations in friending practices on, 19
假阴性和假阳性,95
False negatives and positives, 95
名声,87–98,262 n
Fame, 87–98, 262n
职业选择,112–16
career choice and, 112–16
队列分析,101–12
cohort analysis of, 101–12
耻辱与,116–19
infamy versus, 116–19
传真机的发明,166–67
Fax machine, invention of, 166–67
联邦调查局(FBI),201
Federal Bureau of Investigation (FBI), 201
联邦贸易委员会,272 n
Federal Trade Commission, 272n
电影业,对共产主义影响的指控,139–41
Film industry, allegations of communist influence in, 139–41
Fitbit,198
Fitbit, 198
菲茨杰拉德,F.斯科特,135
Fitzgerald, F. Scott, 135
FiveThirtyEight 博客,15
FiveThirtyEight blog, 15
Flickr,10
Flickr, 10
流感流行,14,63
Flu epidemics, 14, 63
福布斯,251 n
Forbes, 251n
遗忘,155,157–64,173
Forgetting, 155, 157–64, 173
财富 500 强企业,55
Fortune 500 companies, 55
法律部门,59
legal departments of, 59
化石,31
Fossils, 31
语言学,32,37
linguistic, 32, 37
福斯特,约翰·W.,244 n
Foster, John W., 244n
《基地》(阿西莫夫),208,209
Foundation (Asimov), 208, 209
福勒,詹姆斯,13岁
Fowler, James, 13
法国,52,69,151,167,195
France, 52, 69, 151, 167, 195
夏加尔,125–26,132
Chagall in, 125–26, 132
弗朗西斯·W·纳尔逊,256 n
Francis, W. Nelson, 256n
弗兰克,安妮,148
Frank, Anne, 148
弗洛伊德,西格蒙德,117
Freud, Sigmund, 117
富勒,巴克敏斯特,267 n
Fuller, Buckminster, 267n
伽利略, 6–7, 82, 182–83, 188, 205–6, 208, 245 n , 248 n
Galileo, 6–7, 82, 182–83, 188, 205–6, 208, 245n, 248n
盖洛普民意调查,15,57,255 n
Gallup polls, 15, 57, 255n
四人帮,141
Gang of Four, 141
盖茨,比尔,174,251 n
Gates, Bill, 174, 251n
高更,保罗,128
Gauguin, Paul, 128
高斯,卡尔·弗里德里希,115–16
Gauss, Carl Friedrich, 115–16
广义相对论,20
General relativity theory, 20
埃塞俄比亚种族灭绝,118
Genocide, Ethiopian, 118
基因组测序,65,256 n
Genome sequencing, 65, 256n
德国,152
Germany, 152
纳粹。参见盖世太保;纳粹
Nazi. See Gestapo; Nazis
盖世太保,135
Gestapo, 135
吉利根,卡罗尔,96–97
Gilligan, Carol, 96–97
金斯伯格,杰里米,14岁,63岁
Ginsberg, Jeremy, 14, 63
公开性,138
Glasnost, 138
葛兰素史克,207-8
GlaxoSmithKline, 207–8
约瑟夫·戈培尔,127–29, 134–35, 262 n
Goebbels, Joseph, 127–29, 134–35, 262n
戈德温,塞缪尔,139–40
Goldwyn, Samuel, 139–40
谷歌,9,10,16,22,55,84–86,88,95,178,204
Google, 9, 10, 16, 22, 55, 84–86, 88, 95, 178, 204
中国审查制度,142,265 n,268 n
Chinese censorship of, 142, 265n, 268n
报纸档案数字化,190
digitization of newspaper archives, 190
地球,269北纬
Earth, 269n
玻璃,197,295
Glass, 197, 295
图书馆峰会(2010),179
Library Summit (2010), 179
研究,86
Research, 86
趋势,14,63,248 n
Trends, 14, 63, 248n
Google 图书,16–18、22、25、33、50、55–60、62–66、81、83、86、190、255 n、259 n
Google Books, 16–18, 22, 25, 33, 50, 55–60, 62–66, 81, 83, 86, 190, 255n, 259n
和实体书籍的处理,188–90,247 n
and disposal of physical books, 188–90, 247n
HathiTrust 数字图书馆和 260 n
HathiTrust Digital Library and, 260n
基于语言语料库,258 n
linguistic corpus based on, 258n
另请参阅Ngram 查看器
See also Ngram Viewer
戈尔巴乔夫,米哈伊尔,138
Gorbachev, Mikhail, 138
美国政府印刷局,69
Government Printing Office, U.S., 69
格拉夫顿,安东尼,253–54 n
Grafton, Anthony, 253–54n
格兰特,尤利西斯·S.,3
Grant, Ulysses S., 3
格雷,伊莱沙,169
Gray, Elisha, 169
格雷,马修,66,180
Gray, Matthew, 66, 180
《远大前程》(狄更斯),93
Great Expectations (Dickens), 93
大清洗,137,138
Great Purge, 137, 138
希腊,古,81,191,195,270 n
Greece, ancient, 81, 191, 195, 270n
格罗皮乌斯,沃尔特,145
Gropius, Walter, 145
冈瑟,彼得,263 n
Guenther, Peter, 263n
半衰期,44–45,112,163,253n
Half-life, 44–45, 112, 163, 253n
汉利,迈尔斯 L.,33,37,42,48,50,249 n,250 n
Hanley, Miles L., 33, 37, 42, 48, 50, 249n, 250n
哈里斯,马尔科姆,204
Harris, Malcolm, 204
哈里森,本杰明,244 n
Harrison, Benjamin, 244n
哈佛大学, 8, 14, 32, 42, 46, 96, 106, 192, 253 n , 272 n
Harvard University, 8, 14, 32, 42, 46, 96, 106, 192, 253n, 272n
选择退出 Google 图书,247 n
opting out of Google Books, 247n
Program for Evolutionary Dynamics, 28, 176
怀德纳图书馆,16、17、247号
Widener Library, 16, 17, 247n
标签,69
Hashtags, 69
HathiTrust 数字图书馆,188,260 n
HathiTrust Digital Library, 188, 260n
艺术之家(慕尼黑),129, 130
Haus der Kunst (Munich), 129, 130
哈维尔,瓦茨拉夫,95–97
Havel, Václav, 95–97
希伯来圣经,149,246 n
Hebrew Bible, 149, 246n
索引,48
concordances, 48
身高变化,34–35
Height, variations in, 34–35
海涅,海因里希,122,135
Heine, Heinrich, 122, 135
海明威,欧内斯特,103
Hemingway, Ernest, 103
亨利八世,英格兰国王,260年
Henry VIII, King of England, 260n
赫尔德,约翰·戈特弗里德,175,268 n
Herder, Johann Gottfried, 175, 268n
赫尔曼,沃尔夫冈,134–36
Herrmann, Wolfgang, 134–36
希格斯玻色子,193
Higgs boson, 193
希尔顿,巴黎,89岁
Hilton, Paris, 89
历史趋势,预测,210-12
Historical trends, prediction of, 210–12
希区柯克,阿尔弗雷德,103
Hitchcock, Alfred, 103
希特勒,阿道夫,117–19、127、128、133、134、147
Hitler, Adolf, 117–19, 127, 128, 133, 134, 147
霍加斯,凯瑟琳,94岁
Hogarth, Catherine, 94
好莱坞十人,139–40
Hollywood Ten, 139–40
大屠杀,145
Holocaust, 145
胡克,罗伯特,6,245 n
Hooke, Robert, 6, 245n
霍顿·米夫林,256 n
Houghton Mifflin, 256n
美国众议院,267–68
House of Representatives, U.S., 267–68n
非美活动委员会 (HUAC),139, 140
Un-American Activities Committee (HUAC), 139, 140
胡耀邦,141
Hu Yaobang, 141
赫尔、科德尔、102、106
Hull, Cordell, 102, 106
人类行为与最小努力原则(Zipf),250 n
Human Behavior and the Principle of Least Effort (Zipf), 250n
人类基因组计划,85,193
Human Genome Project, 85, 193
人类言语组计划, 196
Human Speechome Project, 196
人文学科,48,206-8
Humanities, 48, 206–8
匈牙利,149
Hungary, 149
无假设研究,20
Hypothesis-free research, 20
IBM,48,253 n
IBM, 48, 253n
托米斯提库斯索引(布萨),48,253 n
Index Thomisticus (Busa), 48, 253n
文化惯性,210–11
Inertia, cultural, 210–11
《恶名昭彰》,116–19
Infamy, 116–19
流感,14,63
Influenza, 14, 63
信息压制。参见审查制度
Information, suppression of. See Censorship
信息时代,168
Information age, 168
宗教裁判所,205
Inquisition, 205
Instagram,195
Instagram, 195
英特尔国际科学与工程大奖赛,101
Intel International Science and Engineering Fair, 101
知识分子,纳粹反对,135,136
Intellectuals, Nazi campaign against, 135, 136
美国国税局(IRS),14,21
Internal Revenue Service (IRS), 14, 21
国际数据公司(IDC),246
International Data Corporation (IDC), 246
互联网, 10–11, 18, 55, 70, 72, 135, 165 , 181, 275 n
Internet, 10–11, 18, 55, 70, 72, 135, 165, 181, 275n
名人和, 97, 101
celebrities and, 97, 101
中国审查制度,142,265 n
Chinese censorship of, 142, 265n
缺乏获得 195
lack of access to, 195
重新定义的隐私规范,198
privacy norms redefined by, 198
搜索引擎,49.另见Google
search engines on, 49. See also Google
互联网档案馆,188
Internet Archive, 188
发明,165–75,268 n
Inventions, 165–75, 268n
伊朗,192
Iran, 192
不规则动词,37–47, 50–53, 66, 189, 251–53 nn
Irregular verbs, 37–47, 50–53, 66, 189, 251–53nn
半衰期,112,253 n
half-life of, 112, 253n
长期生存,155
long-term survival of, 155
ngrams 178
ngrams of, 178
意大利,168,251 n
Italy, 168, 251n
iTunes,59
iTunes, 59
《Jabberwocky》(卡罗尔),67,68
“Jabberwocky” (Carroll), 67, 68
杰克逊,乔,42岁
Jackson, Joe, 42
提花织机,发明,172
Jacquard loom, invention of, 172
詹姆斯·威廉,157,266 n,271 n
James, William, 157, 266n, 271n
牛仔裤的发明,171,172
Jeans, invention of, 171, 172
耶路撒冷之窗(夏加尔),125,146
Jerusalem Windows (Chagall), 125, 146
耶稣会士,48
Jesuits, 48
耶稣,89,92
Jesus, 89, 92
犹太人,125
Jews, 125
纳粹迫害,122、124、127、128、130、133、135、145、147、149、265 n
Nazi persecution of, 122, 124, 127, 128, 130, 133, 135, 145, 147, 149, 265n
乔克斯,马修,206,207
Jockers, Matthew, 206, 207
约翰·贝茨·克拉克奖章,13,15
John Bates Clark Medal, 13, 15
约翰逊,塞缪尔,68,258 n
Johnson, Samuel, 68, 258n
乔斯·马丁,249 n,250 n
Joos, Martin, 249n, 250n
乔丹,迈克尔·I.,95岁
Jordan, Michael I., 95
官方公报,69
Journal Officiel, 69
乔伊斯·詹姆斯,33,34
Joyce, James, 33, 34
卡利尔,汤姆,207
Kalil, Tom, 207
加米涅夫,列夫,137,138,139
Kamenev, Lev, 137, 138, 139
康定斯基,瓦西里,128
Kandinsky, Wassily, 128
卡戴珊,金,89岁
Kardashian, Kim, 89
卡赞,埃利亚,66岁
Kazan, Elia, 66
凯勒,海伦,122–24,135,146,148,262
Keller, Helen, 122–24, 135, 146, 148, 262n
赫鲁晓夫,尼基塔,138
Khrushchev, Nikita, 138
Kindle电子书阅读器,189,253 n
Kindle e-book reader, 189, 253n
基希纳,恩斯特·路德维希,132
Kirchner, Ernst Ludwig, 132
克利,保罗,132
Klee, Paul, 132
小林健, 89, 92
Kobayashi, Takeru, 89, 92
库布里克,斯坦利,161
Kubrick, Stanley, 161
库切拉,亨利,256 n
Kucera, Henry, 256n
库兹韦尔,雷,174
Kurzweil, Ray, 174
Lander-Waterman statistics, 256n
语言,67
Language, 67
进化,31,32,37–36,49
evolution of, 31, 32, 37–36, 49
书面,9
written, 9
《语言本能》(平克),88
Language Instinct, The (Pinker), 88
拉德纳,林格,Jr.,139
Lardner, Ring, Jr., 139
大型强子对撞机,193
Large Hadron Collider, 193
劳森,约翰·霍华德,139
Lawson, John Howard, 139
学习,155–56
Learning, 155–56
集体,165–75
collective, 165–75
李,罗伯特·E.,3,5
Lee, Robert E., 3, 5
安东尼·范·列文虎克,245 n
Leeuwenhoek, Antonie van, 245n
传奇的、词汇的、饶舌的爱情(赖默),26–28、33、48、64
Legendary, Lexical, Loquacious Love (Reimer), 26–28, 33, 48, 64
Lemelson-MIT奖,174
Lemelson-MIT Prize, 174
列宁,弗拉基米尔,117,137
Lenin, Vladimir, 117, 137
约翰·列侬,89,92,118
Lennon, John, 89, 92, 118
镜头,5-7
Lenses, 5–7
列文,约翰,13,21,60
Levin, John, 13, 21, 60
词典学,68–76
Lexicography, 68–76
Zipfian,72–75,77,78
Zipfian, 72–75, 77, 78
图书馆,15–17,82–84,123,179–80,189
Libraries, 15–17, 82–84, 123, 179–80, 189
亚历山大,56岁
Alexandria, 56
书籍处理,189–90
book disposal, 189–90
卡片目录,82–83,265 n
card catalogs, 82–83, 265n
数字, 15–17, 50, 55–56, 83, 188, 189
digital, 15–17, 50, 55–56, 83, 188, 189
纳粹德国,123、133–35
Nazi Germany, 123, 133–35
物理,55,56,82,83,84,189,247 n,265 n
physical, 55, 56, 82, 83, 84, 189, 247n, 265n
另请参阅特定库
See also specific libraries
美国国会图书馆,17
Library of Congress, U.S., 17
《生活?还是戏剧?》(萨洛蒙),148
Life? or Theatre? (Salomon), 148
生活记录,196–201
Life logging, 196–201
获取风险,203-5
risks of access to, 203–5
林尤里,187
Lin, Yuri, 187
林肯,亚伯拉罕,118–19,167
Lincoln, Abraham, 118–19, 167
林道尔,西格蒙德,170
Lindauer, Sigmund, 170
LinkedIn,9
LinkedIn, 9
Linux,249 n
Linux, 249n
劳合·乔治,大卫,190
Lloyd George, David, 190
尼斯湖水怪,54
Loch Ness Monster, 54
洛克,约翰,150
Locke, John, 150
洛杉矶郡立艺术博物馆,北263
Los Angeles County Museum of Art, 263n
洛杉矶时报,52
Los Angeles Times, 52
洛厄尔,珀西瓦尔,182,183–84,269 n
Lowell, Percival, 182, 183–84, 269n
卢西塔尼亚号(远洋客轮),158,159
Lusitania (ocean liner), 158, 159
马丁·路德,133,260
Luther, Martin, 133, 260n
马斯,赫尔曼,145
Maas, Hermann, 145
马尔茨,阿尔伯特,139
Maltz, Albert, 139
曼哈顿计划,103
Manhattan Project, 103
手稿,191
Manuscripts, 191
制图,21
Mapping, 21
水手探测器,182,184,269 n
Mariner probes, 182, 184, 269n
马克思,卡尔,117,135
Marx, Karl, 117, 135
马索拉,48岁
Masorah, 48
麻省理工学院(MIT),88,174
Massachusetts Institute of Technology (MIT), 88, 174
媒体实验室,认知机器组,196
Media Lab, Cognitive Machines Group, 196
大规模开放在线课程(MOOC),58
Massive open online courses (MOOCs), 58
数学家的名声,113,115
Mathematicians, fame of, 113, 115
亨利·马蒂斯,128
Matisse, Henri, 128
梅耶,路易斯·B.,139–40
Mayer, Louis B., 139–40
梅耶尔,玛丽莎,55–57
Mayer, Marissa, 55–57
麦卡锡,约瑟夫,140
McCarthy, Joseph, 140
麦克杜格尔,达蒙,243
McDougall, Damon, 243
麦格劳-希尔,59岁
McGraw-Hill, 59
麦克弗森,詹姆斯,2–3,5,244 n
McPherson, James, 2–3, 5, 244n
记忆,154–57
Memory, 154–57
集体,153–54,157–64
collective, 153–54, 157–64
数字化, 个人, 195
digital, of individuals, 195
梅卡德,拉蒙,137
Mercader, Ramón, 137
韦氏在线词典,259 n
Merriam-Webster’s online dictionary, 259n
安东尼奥·梅乌奇,168–69,268 n
Meucci, Antonio, 168–69, 268n
墨西哥,138
Mexico, 138
密歇根大学,55,255 n
Michigan, University of, 55, 255n
胡克氏显微图谱(245 n )
Micrographia (Hooke), 245n
显微镜,6–7,245 n
Microscopes, 6–7, 245n
微软,31,59
Microsoft, 31, 59
中古英语,42–43
Middle English, 42–43
思维记录,204
Mind logging, 204
现代艺术,纳粹镇压,127–33,146
Modern art, Nazi suppression of, 127–33, 146
蒙德里安,皮特,128
Mondrian, Piet, 128
摩尔定律,174
Moore’s law, 174
莫雷蒂,弗朗哥,206
Moretti, Franco, 206
琼斯母亲181
Mother Jones, 181
爱德华·蒙克,128
Munch, Edvard, 128
芒罗,兰德尔,243
Munroe, Randall, 243
著名凶手,118
Murderers, famous, 118
纽约现代艺术博物馆,128,131
Museum of Modern Art (New York), 128, 131
马斯克,伊隆,267 n
Musk, Elon, 267n
墨索里尼,贝尼托,117,118
Mussolini, Benito, 117, 118
相互确保摧毁(MAD),174
Mutually Assured Destruction (MAD), 174
内森·梅尔沃德(Nathan Myhrvold) 31岁
Myhrvold, Nathan, 31
纳吉马巴迪,阿夫萨内,192
Najmabadi, Afsaneh, 192
NASA(美国国家航空航天局),184
NASA (National Aeronautics and Space Administration), 184
国家人文基金会(NEH),190,191,193,206-8
National Endowment for the Humanities (NEH), 190, 191, 193, 206–8
美国国立卫生研究院(NIH),207,208
National Institutes of Health (NIH), 207, 208
美国国家医学图书馆,207
National Library of Medicine, 207
国家技术奖章,174
National Medal of Technology, 174
美国国家科学基金会,196
National Science Foundation, 196
自然选择,进化,20,41-44
Natural selection, evolution by, 20, 41–44
自然,13
Nature, 13
纳粹,122–24, 127–37, 264 n
Nazis, 122–24, 127–37, 264n
焚书,124,133–36,175,262 n
book burnings, 124, 133–36, 175, 262n
现代艺术被谴责和摧毁,127–33,263 nn,265 n
modern art denounced and destroyed by, 127–33, 263nn, 265n
内布拉斯加大学,206
Nebraska, University of, 206
新词,67
Neologisms, 67
巴勃罗·聂鲁达,102–3
Neruda, Pablo, 102–3
纽约时报, 40,61,123,181,190,256
New York Times, 40, 61, 123, 181, 190, 256n
纽约大学,96
New York University, 96
《纽约客》,157
New Yorker, 157
报纸数字化,190。另见特定报纸
Newspapers, digitization of, 190. See also specific newspapers
n-gram,65
n-gram, 65
Ngram 查看器、23、180–81、187–88、243、269 n、270 n
Ngram Viewer, 23, 180–81, 187–88, 243, 269n, 270n
尼克雷,马修,270 n
Nicklay, Matthew, 270n
Nike+ FuelBand,198
Nike+ FuelBand, 198
1984(奥威尔),160
1984 (Orwell), 160
诺贝尔奖,13,102
Nobel Prize, 13, 102
诺尔德,埃米尔,132
Nolde, Emil, 132
Nordau,Max,127,263 n
Nordau, Max, 127, 263n
正态分布,35
Normal distribution, 35
东北大学,13,196
Northeastern University, 13, 196
彼得·诺维格,58, 60, 62, 64, 66, 86
Norvig, Peter, 58, 60, 62, 64, 66, 86
挪威,98–100
Norway, 98–100
诺瓦克,马丁,28,179,248 n
Nowak, Martin, 28, 179, 248n
诺瓦克,塞巴斯蒂安,179
Nowak, Sebastian, 179
奥巴马,巴拉克,15,204,207
Obama, Barack, 15, 204, 207
占领华尔街运动,204
Occupy Wall Street movement, 204
古英语,42
Old English, 42
雾都孤儿(狄更斯),93
Oliver Twist (Dickens), 93
奥本海默,罗伯特,103
Oppenheimer, Robert, 103
奥尼茨,塞缪尔,139
Ornitz, Samuel, 139
乔恩·奥尔万特(Jon Orwant),66岁,180
Orwant, Jon, 66, 180
奥威尔,乔治,160
Orwell, George, 160
牛津英语词典(OED),68–69,74,76,77,257 n,259 n
Oxford English Dictionary (OED), 68–69, 74, 76, 77, 257n, 259n
牛津大学,16
Oxford University, 16
牛津大学出版社,257页
Oxford University Press, 257n
佩奇,拉里,16,55–57
Page, Larry, 16, 55–57
佩林,莎拉,67岁,256岁
Palin, Sarah, 67, 256n
帕累托,维尔弗雷多,251 n
Pareto, Vilfredo, 251n
帕森斯,雷塔赫,202,271 n
Parsons, Rehtaeh, 202, 271n
帕特尔,鲁帕尔,196
Patel, Rupal, 196
专利,167–70
Patents, 167–70
珍珠港,日本袭击,157–58,165
Pearl Harbor, Japanese attack on, 157–58, 165
培生教育,59
Pearson Education, 59
企鹅美国,59
Penguin USA, 59
彭尼贝克,詹姆斯,207
Pennebaker, James, 207
便士邮报,267 n
Penny Post, 267n
改革,138
Perestroika, 138
珀尔修斯图书馆项目,270 n
Perseus Library Project, 270n
斯拉夫·彼得罗夫,187
Petrov, Slav, 187
摄影,202-3
Photography, 202–3
物理学,7,37,49,109
Physics, 7, 37, 49, 109
毕加索、巴勃罗,128、145、263 n
Picasso, Pablo, 128, 145, 263n
皮克特,约瑟夫·P.,256 n,258 n
Pickett, Joseph P., 256n, 258n
《匹克威克外传》(狄更斯),92,94
Pickwick Papers, The (Dickens), 92, 94
Pinker,Steven,86–88,96,179,256 n,266 n
Pinker, Steven, 86–88, 96, 179, 256n, 266n
效忠誓言,2,204
Pledge of Allegiance, 2, 204
埃德加·爱伦·坡,186,188,190–92,194,195
Poe, Edgar Allan, 186, 188, 190–92, 194, 195
政客的名声,113,115
Politicians, fame of, 113, 115
波特伍德,奈杰尔,257 n
Portwood, Nigel, 257n
幂律,35–37,251 n,266 n
Power laws, 35–37, 251n, 266n
价格理论,13
Prices, theory of, 13
印刷油墨,248 n
Printers’ Ink, 248n
隐私
Privacy
互联网与规范的重新定义,198
Internet and redefinition of norms of, 198
生命记录技术的潜在影响,203–5
potential impacts of life -logging technology on, 203–5
古腾堡计划,188,270 n
Project Gutenberg, 188, 270n
宣传,129
Propaganda, 129
纳粹,124,127,130,143,144
Nazi, 124, 127, 130, 143, 144
原始日耳曼语,39
Proto-Germanic languages, 39
原始印欧语系,39
Proto-Indo-European languages, 39
普鲁斯特,马塞尔,102
Proust, Marcel, 102
语言心理生物学(Zipf),249 n,254 n
Psychobiology of Language, The (Zipf), 249n, 254n
心理史学,208–12
Psychohistory, 208–12
心理语言学,155–57
Psycholinguistics, 155–57
托勒密的地球中心宇宙观,7
Ptolemaic notion of Earth-centric universe, 7
普利策奖,3,256 n
Pulitzer Prize, 3, 256n
Python,243
Python, 243
量子力学,7
Quantum mechanics, 7
丹·奎尔,67岁
Quayle, Dan, 67
种族主义,175–76
Racism, 175–76
放射性理论,44
Radioactivity, theory of, 44
瑞利散射,248 n
Rayleigh scattering, 248n
罗纳德·里根,117,139
Reagan, Ronald, 117, 139
红色恐慌,66
Red Scare, 66
雷德福,罗伯特,95–96
Redford, Robert, 95–96
Reichskulturkammer(帝国文化室),127–28
Reichskulturkammer (Reich Culture Chamber), 127–28
Reimer,Karen,26–28,48,64,253 n
Reimer, Karen, 26–28, 48, 64, 253n
宗教,科学革命的影响,7
Religion, impact of scientific revolution on, 7
共和党全国委员会,95
Republican National Committee, 95
Revolver, invention of, 171–72
“富人越来越富”的过程,26
“Rich get richer” process, 26
基本权利,150–51
Rights, fundamental, 150–51
罗克莫尔,丹尼尔,207
Rockmore, Daniel, 207
罗马帝国,150年
Roman Empire, 150
罗姆尼,米特,15岁
Romney, Mitt, 15
罗斯福,富兰克林·德拉诺,102,117
Roosevelt, Franklin Delano, 102, 117
罗斯福,西奥多,69,117,257 n
Roosevelt, Theodore, 69, 117, 257n
罗伊,黛布,196,197
Roy, Deb, 196, 197
罗伊,德韦恩,196–98,203
Roy, Dwayne, 196–98, 203
鲁宾,托马斯,59岁
Rubin, Thomas, 59
罗素,伯特兰,261 n
Russell, Bertrand, 261n
罗素,亨利·诺里斯,183
Russell, Henry Norris, 183
俄罗斯国家图书馆,16
Russia, National Library of, 16
俄国革命,125,137,138
Russian Revolution, 125, 137, 138
卢瑟福,欧内斯特,102
Rutherford, Ernest, 102
所罗门,夏洛特,147–48
Salomon, Charlotte, 147–48
大脚怪,53–54
Sasquatch, 53–54
斯卡利亚,安东尼,256 n
Scalia, Antonin, 256n
斯堪杜侦察兵,198
Scanadu Scout, 198
乔瓦尼·斯基亚帕雷利,183
Schiaparelli, Giovanni, 183
施曼特-贝塞拉特,丹尼斯,245 n
Schmandt-Besserat, Denise, 245n
施密特,本杰明,270 n
Schmidt, Benjamin, 270n
科学(期刊),180,250 n,261 n
Science (journal), 180, 250n, 261n
科学,206-8。另见具体科学
Sciences, 206–8. See also specific sciences
科学方法,19-20
Scientific method, 19–20
科学革命,7
Scientific revolution, 7
科学家的名声,113,115
Scientists, fame of, 113, 115
范围,6-8
Scopes, 6–8
斯科特·阿德里安,139
Scott, Adrian, 139
尖叫(蒙克),128
Scream, The (Munch), 128
西尔斯·大卫,207–8
Searls, David, 207–8
莎士比亚,威廉,31,67,206,254 n
Shakespeare, William, 31, 67, 206, 254n
香农,克劳德,261 n
Shannon, Claude, 261n
绵羊,数数,8-9
Sheep, counting, 8–9
沈元,66,81,84,176–77
Shen, Yuan, 66, 81, 84, 176–77
谢里丹,菲利普,3岁
Sheridan, Philip, 3
谢尔曼,威廉·特库姆塞,3,89
Sherman, William Tecumseh, 3, 89
Shortz,Will,256 n
Shortz, Will, 256n
西尔弗,内特,15岁
Silver, Nate, 15
西蒙与舒斯特出版社,59
Simon & Schuster, 59
简化拼写板,257 n
Simplified Spelling Board, 257n
奇点,技术,174
Singularity, technological, 174
Skype,9
Skype, 9
Slipher,EC,269 n
Slipher, E. C., 269n
Snapchat,205,272 n
Snapchat, 205, 272n
斯诺登,爱德华,204
Snowden, Edward, 204
社会科学,207,209-10
Social science, 207, 209–10
索尼随身听,172
Sony Walkman, 172
太空竞赛,120
Space race, 120
斯普尼克号,120
Sputnik, 120
约瑟夫·斯大林,117, 118, 137–39, 264 n
Stalin, Joseph, 117, 118, 137–39, 264n
斯坦福大学,13,58,206,247
Stanford University, 13, 58, 206, 247n
数字图书馆技术项目,16
Digital Library Technologies Project, 16
斯蒂尔,迈克尔,94岁
Steele, Michael, 94
斯图尔特·波特 89岁
Stewart, Potter, 89
施特劳斯,列维,171,172
Strauss, Levi, 171, 172
斯特拉特,约翰,248 n
Strutt, John, 248n
美国最高法院,256 n
Supreme Court, U.S., 256n
监视,202,205
Surveillance, 202, 205
《双城记》(狄更斯),93,95
Tale of Two Cities, A (Dickens), 93, 95
唐蒂娜(Tang, Tina),42岁
Tang, Tina, 42
Target商店,204家
Target stores, 204
教师效能,14-15
Teacher effectiveness, 14–15
电信行业,168
Telecommunications industry, 168
电话的发明,167–70
Telephone, invention of, 167–70
望远镜,6–7,245 n
Telescopes, 6–7, 245n
德克萨斯大学奥斯汀分校,191,207
Texas, University of, at Austin, 191, 207
文本,未发表,191–92
Texts, unpublished, 191–92
特雷门琴的发明,172,268 n
Theremin, invention of, 172, 268n
托马斯·阿奎那,48岁
Thomas Aquinas, 48
塞巴斯蒂安·特伦(Sebastian Thrun) 58岁
Thrun, Sebastian, 58
查尔斯·瑟伯,170
Thurber, Charles, 170
天安门广场大屠杀,141–43
Tiananmen Square massacre, 141–43
《时代》杂志,84、87、97
Time magazine, 84, 87, 97
列夫·托尔斯泰,95岁
Tolstoy, Leo, 95
托洛茨基,莱昂,137–38
Trotsky, Leon, 137–38
Trove项目,190
Trove project, 190
杜鲁门,哈里,140
Truman, Harry, 140
特伦博,道尔顿,139,141
Trumbo, Dalton, 139, 141
结核病,98–101
Tuberculosis, 98–101
马克·吐温,102,106,116,254 n
Twain, Mark, 102, 106, 116, 254n
推特, 10, 36, 59, 69, 181, 195, 204
Twitter, 10, 36, 59, 69, 181, 195, 204
打字机的发明,170,172
Typewriter, invention of, 170, 172
霸王龙,31–32
Tyrannosaurus rex, 31–32
Ulam,Stanislaw,173,268 n
Ulam, Stanislaw, 173, 268n
厄尔曼,米查,262 n
Ullman, Micha, 262n
《尤利西斯》(乔伊斯),33,34,37,249页
Ulysses (Joyce), 33, 34, 37, 249n
苏维埃社会主义共和国联盟(苏联),120,137-38
Union of Soviet Socialist Republics (USSR), 120, 137–38
联合国,102
United Nations, 102
乌普萨拉大学,245 n
Uppsala University, 245n
Urbandictionary.com,76
Urbandictionary.com, 76
梵高,文森特,88–89
Van Gogh, Vincent, 88–89
范德普拉斯,杰克,243
VanderPlas, Jake, 243
瓦里安,哈尔,266 n
Varian, Hal, 266n
梵蒂冈秘密档案馆,82,259–60 n
Vatican Secret Archive, 82, 259–60n
天鹅绒革命,96
Velvet Revolution, 96
阿德里安·韦雷斯,101–2, 261 nn
Veres, Adrian, 101–2, 261nn
维也纳圈,152–54
Vienna Circle, 152–54
约翰·冯·诺依曼,173–74,268 n
Von Neumann, John, 173–74, 268n
选民参与度,13
Voter participation, 13
瓦格纳,理查德,117
Wagner, Richard, 117
《战争与和平》(托尔斯泰),95
War and Peace (Tolstoy), 95
《世界大战》(威尔斯),184
War of the Worlds, The (Wells), 184
安迪·沃霍尔,108
Warhol, Andy, 108
华盛顿邮报,2–3,52
Washington Post, 2–3, 52
威尔斯,HG,135,184
Wells, H. G., 135, 184
西部电气制造公司,169
Western Electric Manufacturing Company, 169
怀特,威廉·艾伦,106
White, William Allen, 106
白宫科技政策办公室,207
White House Office of Science and Technology Policy, 207
维基百科,146、170、247 n、248 n、261 n
Wikipedia, 146, 170, 247n, 248n, 261n
Wiktionary.com,76
Wiktionary.com, 76
威利,约翰,出版公司,59
Wiley, John, publishing company, 59
威廉王子,88岁
William, Prince, 88
女性,146,151,192
Women, 146, 151, 192
作为人类计算机,249 n
as human computers, 249n
弗吉尼亚州伍尔夫,北纬 261度
Woolf, Virginia, 261n
词频,27–28,32–37,65,249 n
Word frequencies, 27–28, 32–37, 65, 249n
名望相关性,110–12
correlation of fame and, 110–12
不规则动词,41,43,44
of irregular verbs, 41, 43, 44
詹姆斯·乔伊斯《尤利西斯》词汇索引(汉利出版社),33、42、47、48、50
Word Index to James Joyce’s Ulysses (Hanley), 33, 42, 47, 48, 50
Wordnik.com,76
Wordnik.com, 76
词语,识别,68–72
Words, identifying, 68–72
世界贸易中心,恐怖袭击(9/11),157–58
World Trade Center, terrorist attack on (9/11), 157–58
第一次世界大战,159,160
World War I, 159, 160
第二次世界大战,102,139,144,147,158,159,160
World War II, 102, 139, 144, 147, 158, 159, 160
万维网,16,59,62,193。另请参阅互联网
World Wide Web, 16, 59, 62, 193. See also Internet
赖特,奥维尔,90–92,98,102,109,117
Wright, Orville, 90–92, 98, 102, 109, 117
赖特,威尔伯,90–92,98,117
Wright, Wilbur, 90–92, 98, 117
作家,名声,113,114
Writers, fame of, 113, 114
冯特,威廉,266 n
Wundt, Wilhelm, 266n
亚德瓦谢姆(耶路撒冷),145
Yad Vashem (Jerusalem), 145
雅虎,55
Yahoo!, 55
杨临峰,270n
Yang, Linfeng, 270n
YouTube,10,88,195
YouTube, 10, 88, 195
汤姆·泽勒,小,61–62
Zeller, Tom, Jr., 61–62
周恩来,141
Zhou Enlai, 141
齐格勒,阿道夫,128,129
Ziegler, Adolf, 128, 129
季诺维也夫,格里戈里,137,138
Zinoviev, Grigory, 137, 138
Zipf,George Kingsley,32–38,49–50,53,71–75,78,99,189,249–51 nn,254 n,272 n
Zipf, George Kingsley, 32–38, 49–50, 53, 71–75, 78, 99, 189, 249–51nn, 254n, 272n